Closed bistline closed 1 week ago
Attention: Patch coverage is 97.43590%
with 1 line
in your changes missing coverage. Please review.
Project coverage is 70.07%. Comparing base (
d2f033c
) to head (51b193a
). Report is 10 commits behind head on development.
Files with missing lines | Patch % | Lines |
---|---|---|
app/models/ingest_job.rb | 95.83% | 1 Missing :warning: |
BACKGROUND
With the addition of AnnData parsing & automated differential expression calculation for qualifying studies, we have seen a rise of ingest processes failing with error codes of
137
or139
, both of which indicate an out of memory exception. Usually, these jobs can be re-rerun using larger GCE instances and will eventually successfully ingest. However, this process is entirely manual and requires direct admin intervention which can be quite cumbersome as it requires reconstituting files/parameters from various sources.CHANGES
Now, any ingest process that fails to ingest with either of the above error codes and uses a parameters class (i.e.
AnnDataIngestParameters
,DifferentialExpressionParameters
) will automatically retry the job using the next available machine type. For instance, if a file fails withn2d-highmem-8
, it will automatically retry withn2d-highmem-16
. This process will continue all the way up ton2d-highmem-64
, after which if the file fails to ingest, all retries cease and end user is finally notified. Normal admin messaging & cleanup procedures still apply for all retries, ensuring visibility into the process. Additionally, we could also set up Mixpanel alerts for any job failures with the corresponding error codes to further enhance visibility.The reason for the requirement on the parameters class is that normal ingest jobs of "classic" SCP files do not employ dynamic scaling of machines, and would require more significant refactoring to enable. Furthermore, since we starting tracking
exitStatus
for ingest jobs, there have been no instances of non-AnnData files failing due to OOM exceptions.MANUAL TESTING
To properly test, this does require the use of a large AnnData file that is known to fail on the default
machine_type
used for its ingest. Such a file can be found here if you do not have access to one locally: https://console.cloud.google.com/storage/browser/_details/broad-singlecellportal-staging-testing-data/SCP2807/HRCA_snRNA_AC.h5ad;tab=live_object?project=broad-singlecellportal-stagingdevelopment.log
, look for the following message (the ID/accession will be different):machine_type
:Note: you do not need to let the file complete the ingest process as this will take quite a while - it is still running 90 min into the retry for my manual test.