broadinstitute / single_cell_portal_core

Rails/Docker application for the Broad Institute's single cell RNA-seq data portal
https://singlecell.broadinstitute.org
BSD 3-Clause "New" or "Revised" License
62 stars 26 forks source link

Automatic retries of OOM failures for ingest processes (SCP-5827) #2154

Closed bistline closed 1 week ago

bistline commented 2 weeks ago

BACKGROUND

With the addition of AnnData parsing & automated differential expression calculation for qualifying studies, we have seen a rise of ingest processes failing with error codes of 137 or 139, both of which indicate an out of memory exception. Usually, these jobs can be re-rerun using larger GCE instances and will eventually successfully ingest. However, this process is entirely manual and requires direct admin intervention which can be quite cumbersome as it requires reconstituting files/parameters from various sources.

CHANGES

Now, any ingest process that fails to ingest with either of the above error codes and uses a parameters class (i.e. AnnDataIngestParameters, DifferentialExpressionParameters) will automatically retry the job using the next available machine type. For instance, if a file fails with n2d-highmem-8, it will automatically retry with n2d-highmem-16. This process will continue all the way up to n2d-highmem-64, after which if the file fails to ingest, all retries cease and end user is finally notified. Normal admin messaging & cleanup procedures still apply for all retries, ensuring visibility into the process. Additionally, we could also set up Mixpanel alerts for any job failures with the corresponding error codes to further enhance visibility.

The reason for the requirement on the parameters class is that normal ingest jobs of "classic" SCP files do not employ dynamic scaling of machines, and would require more significant refactoring to enable. Furthermore, since we starting tracking exitStatus for ingest jobs, there have been no instances of non-AnnData files failing due to OOM exceptions.

MANUAL TESTING

To properly test, this does require the use of a large AnnData file that is known to fail on the default machine_type used for its ingest. Such a file can be found here if you do not have access to one locally: https://console.cloud.google.com/storage/browser/_details/broad-singlecellportal-staging-testing-data/SCP2807/HRCA_snRNA_AC.h5ad;tab=live_object?project=broad-singlecellportal-staging

  1. Boot all services and create a new study, selecting the AnnData upload UX
  2. Copy the above file to the study's bucket and save the file using the "bucket path" option
  3. In development.log, look for the following message (the ID/accession will be different):
    Retrying ingest_anndata after 137 failure for HRCA_snRNA_AC.h5ad:6706c4a794ec8f2272dc977f (SCP161) with machine_type: n2d-highmem-16
  4. Note a new job is launched immediately afterwards with the correct machine_type:
    Request object sent to Google Life Sciences API, excluding 'environment' parameters:
    ---
    :actions:
    :commands:
    - python
    - ingest_pipeline.py
    - "--study-id"
    - 66d8bcf394ec8f692a4ebea8
    - "--study-file-id"
    - 6706c4a794ec8f2272dc977f
    - "--user-metrics-uuid"
    - c6a08597-c735-4c93-b069-4b4a74b708ee
    - ingest_anndata
    - "--ingest-anndata"
    - "--anndata-file"
    - gs://fc-b04eabc0-55c0-4047-9a02-5d0f98a8f528/HRCA_snRNA_AC.h5ad
    - "--obsm-keys"
    - '["X_umap"]'
    - "--extract"
    - '["cluster", "metadata", "processed_expression", "raw_counts"]'
    :image_uri: gcr.io/broad-singlecellportal-staging/scp-ingest-pipeline:1.35.0
    :labels: {}
    :timeout: {}
    :resources:
    :regions:
    - us-central1
    :virtual_machine:
    :boot_disk_size_gb: 300
    :labels:
      :study_accession: scp161
      :user_id: 63e27b72f39faa00560c1e25
      :filename: hrca_snrna_ac_h5ad
      :action: ingest_pipeline
      :docker_image: scp-ingest-pipeline
      :docker_tag: '1_35_0'
      :environment: development
      :file_type: anndata
      :machine_type: n2d-highmem-16
      :boot_disk_size_gb: '300'
    :machine_type: n2d-highmem-16 <==== now on 16 instead of 8
    ...

Note: you do not need to let the file complete the ingest process as this will take quite a while - it is still running 90 min into the retry for my manual test.

codecov[bot] commented 2 weeks ago

Codecov Report

Attention: Patch coverage is 97.43590% with 1 line in your changes missing coverage. Please review.

Project coverage is 70.07%. Comparing base (d2f033c) to head (51b193a). Report is 10 commits behind head on development.

Files with missing lines Patch % Lines
app/models/ingest_job.rb 95.83% 1 Missing :warning:
Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2154/graphs/tree.svg?width=650&height=150&src=pr&token=HMWE5BO2a4&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute)](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2154?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute) ```diff @@ Coverage Diff @@ ## development #2154 +/- ## =============================================== + Coverage 69.97% 70.07% +0.10% =============================================== Files 331 331 Lines 27990 28015 +25 Branches 2452 2452 =============================================== + Hits 19585 19631 +46 + Misses 8259 8238 -21 Partials 146 146 ``` | [Files with missing lines](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2154?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute) | Coverage Δ | | |---|---|---| | [app/models/ann\_data\_ingest\_parameters.rb](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2154?src=pr&el=tree&filepath=app%2Fmodels%2Fann_data_ingest_parameters.rb&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL21vZGVscy9hbm5fZGF0YV9pbmdlc3RfcGFyYW1ldGVycy5yYg==) | `100.00% <100.00%> (ø)` | | | [app/models/concerns/compute\_scaling.rb](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2154?src=pr&el=tree&filepath=app%2Fmodels%2Fconcerns%2Fcompute_scaling.rb&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL21vZGVscy9jb25jZXJucy9jb21wdXRlX3NjYWxpbmcucmI=) | `100.00% <100.00%> (ø)` | | | [app/models/delete\_queue\_job.rb](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2154?src=pr&el=tree&filepath=app%2Fmodels%2Fdelete_queue_job.rb&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL21vZGVscy9kZWxldGVfcXVldWVfam9iLnJi) | `63.52% <100.00%> (+0.70%)` | :arrow_up: | | [app/models/differential\_expression\_parameters.rb](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2154?src=pr&el=tree&filepath=app%2Fmodels%2Fdifferential_expression_parameters.rb&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL21vZGVscy9kaWZmZXJlbnRpYWxfZXhwcmVzc2lvbl9wYXJhbWV0ZXJzLnJi) | `100.00% <100.00%> (ø)` | | | [app/models/ingest\_job.rb](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2154?src=pr&el=tree&filepath=app%2Fmodels%2Fingest_job.rb&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL21vZGVscy9pbmdlc3Rfam9iLnJi) | `55.26% <95.83%> (+3.60%)` | :arrow_up: | ... and [3 files with indirect coverage changes](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2154/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute)