chanzuckerberg / single-cell-data-portal

The data portal supporting the submission, exploration, and management of projects and datasets to cellxgene.
MIT License
63 stars 12 forks source link

Report error to UI when AWS Error occurs during ingestion #7014

Open nayib-jose-gloria opened 5 months ago

nayib-jose-gloria commented 5 months ago

A transient AWS Error (TooManyRequests) before the RegisterJobDefinition step in the ingestion SFN caused a dataset to fail during ingestion; this error was not persisted to the UI or backend dataset status (presumably our error-handling set-up in happy is not sufficient for this error type, as normally we expect SFN errors to result in running the HandleErrors step, which would persist such errors to the backend + UI).

As a result, a curator did not realize their dataset upload had failed and waited a day before flagging to devs, who then identified the issue and resolving by re-driving the failed SFN.

Look into why this is the case, and any other transient AWS errors that could cause this confusing state. Implement a fix that will alert end users OR engineers of AWS failures in the SFN.

Also, investigate the rate limit we triggered and how we can avoid it.

error example output:

{
  "cause": "Too Many Requests (Service: Batch, Status Code: 429, Request ID: 448c1591-a838-4666-9aec-1bd23c9e643f)",
  "error": "Batch.BatchException",
  "resource": "registerJobDefinition",
  "resourceType": "aws-sdk:batch"
}

AWS docs on this error: https://repost.aws/knowledge-center/aws-batch-requests-error

brianraymor commented 5 months ago

Just curious - why is this being throttled?

nayib-jose-gloria commented 5 months ago

good question--added a note to investigate that as part of this issue as well + docs around that exception