Open hannes-ucsc opened 3 months ago
Assignee to consider next steps.
I provided a reproduction. Fixing #6417 did not make this worse. Supposedly a 404 from S3 is better than a 500 from Azul.
Also, there is a mitigating factor. See last paragraph of the description.
Assignee to port reproduction to the dev
deployment once fix for #6417 lands there
Assignee to port reproduction to the dev deployment once fix for https://github.com/DataBiosphere/azul/issues/6417 lands there
done.
Make manifest request with filters
{"organ":{"is":["blood"]},"fileFormat":{"is":["fastq.gz"]}}
.Follow the redirect:
List the manifest object from the S3 URL above:
Delete said object:
Request the same manifest. ~In this case we reordered the filter keys (
{"fileFormat":{"is":["fastq.gz"]},"organ":{"is":["pancreas"]}}
), but with #6417 fixed, reordering is not required to reproduce this issue, and the filter key order does not effect the outcome:~Follow the 301 redirect:
Follow the 302 redirect to the manifest object:
S3 returns a 404.
The expected behavior is to kick off a new manifest generation that recreates the manifest object in S3. Note that in the reproduction, after deleting the manifest object, the token returned in the 301 redirect is the same as that from before the deletion, indicating that Azul is reusing the AWS Step Function execution from the first request. This should be a new execution.
The difficulty is that while the step function is running and the object does not yet exist, we do want to reuse that execution, to avoid multiple redundant executions, to save cost and avoid DOS attacks. We need to somehow distinguish that case from the one where the Step Function execution has finished but the manifest object does not exist anymore, because it expired or was explicitly deleted.
The fact that we pretty much redeploy Azul at least once a week prevents this from being a bigger issue. The manifest expiration is also one week, so by the time a manifest object expires, the deployed commit will have changed, which will have invalidated all manifest cache keys and therefore all Step Function executions and new executions will be kicked off even when manifest requests are repeated.