chanzuckerberg / single-cell

A collection of documents that reflect various design decisions that have been made for the cellxgene project.
MIT License
4 stars 2 forks source link

Fix issue with too many batch jobs at once #680

Closed joyceyan closed 8 months ago

joyceyan commented 8 months ago

It looks like some datasets are not being migrated because there were too many batch requests made simultaneously: https://us-west-2.console.aws.amazon.com/states/home?region=us-west-2#/map-runs/executions/arn:aws:states:us-west-2:699936264352:execution:dp-dev-devstack-schema-migration-sfn/SpanCollections:c5ba7769-2bcc-3463-aa4c-c7b701e42f36

Output from the failed CollectionMigration step:

{
  "can_publish": "False",
  "collection_id": "f5af7a2f-ab4c-4728-829e-48efb9562105",
  "collection_version_id": "a9413b57-87ef-46ff-b3a8-660970950a27",
  "error": {
    "Error": "Batch.AWSBatchException",
    "Cause": "Too Many Requests (Service: AWSBatch; Status Code: 429; Error Code: TooManyRequestsException; Request ID: 946809de-635f-48b4-aa1b-186da2a84b1c; Proxy: null)"
  }
}

This was inadvertently introduced when we switched to using distributed mode.

We're going to try increasing the max number of transactions per second to 75, as suggested by @Bento007 here: https://czi-sci.slack.com/archives/C06AVGFV222/p1709584188809559?thread_ts=1709581206.920699&cid=C06AVGFV222