A transient AWS Error (TooManyRequests) before the RegisterJobDefinition step in the ingestion SFN caused a dataset to fail during ingestion; this error was not persisted to the UI or backend dataset status (presumably our error-handling set-up in happy is not sufficient for this error type, as normally we expect SFN errors to result in running the HandleErrors step, which would persist such errors to the backend + UI).
As a result, a curator did not realize their dataset upload had failed and waited a day before flagging to devs, who then identified the issue and resolving by re-driving the failed SFN.
Look into why this is the case, and any other transient AWS errors that could cause this confusing state. Implement a fix that will alert end users OR engineers of AWS failures in the SFN.
Also, investigate the rate limit we triggered and how we can avoid it.
error example output:
{
"cause": "Too Many Requests (Service: Batch, Status Code: 429, Request ID: 448c1591-a838-4666-9aec-1bd23c9e643f)",
"error": "Batch.BatchException",
"resource": "registerJobDefinition",
"resourceType": "aws-sdk:batch"
}
A transient AWS Error (TooManyRequests) before the RegisterJobDefinition step in the ingestion SFN caused a dataset to fail during ingestion; this error was not persisted to the UI or backend dataset status (presumably our error-handling set-up in happy is not sufficient for this error type, as normally we expect SFN errors to result in running the HandleErrors step, which would persist such errors to the backend + UI).
As a result, a curator did not realize their dataset upload had failed and waited a day before flagging to devs, who then identified the issue and resolving by re-driving the failed SFN.
Look into why this is the case, and any other transient AWS errors that could cause this confusing state. Implement a fix that will alert end users OR engineers of AWS failures in the SFN.
Also, investigate the rate limit we triggered and how we can avoid it.
error example output:
AWS docs on this error: https://repost.aws/knowledge-center/aws-batch-requests-error