Add secondary cleanup for deleted files, only extract raw_counts when specified (SCP-5782, SCP-5783)

BACKGROUND

In the AnnData UX, after a file finishes the extraction phase, the 3 main datatypes (expression, metadata, clustering) are all ingested in parallel (if specified). While this is the most performant way to process data, it leads to corner cases where one process fails (expression ingest, for instance) but the other processes succeed. Because the data cleanup happens immediately upon detection of the failure, any processes that are still writing data will not be affected. This leads to a case where orphaned data exists in the database that cannot be cleaned up by the user, and effectively blocks all further ingests using the AnnData UX.

CHANGES

Now, upon successful completion of an ingest process, a "secondary cleanup" check is run to determine if the file has been queued for deletion by another ingest run. This will in practice only apply to AnnData files, as other file types are ingested atomically, and ancillary processes like subsampling or differential expression have their own cleanup policies. Additionally, this will skip sending a "success" email to a user with a file that is in the process of being deleted, meaning the last email the user gets about the file is the failure that caused the delete cascade.

Additionally, this fixes a bug where AnnData files that were not marked as having raw counts data were still extracting raw cell names. Now, the raw_counts extraction only happens when specified.

MANUAL TESTING

TL;DR - this is nearly impossible to test manually. It relies on having a file where only one data type fails to validate, but the jobs are spaced out just so that one fails before the others succeed. That being said, I was able to get it to work using the following steps, though it requires manually changing the polling interval for some jobs.

Download a copy of trimmed_compliant_pbmc3K.h5ad from the testing bucket. This AnnData file has duplicate gene entries, but all other data is valid.

In app/models/ingest_job.rb, modify the else block in poll_for_completion to change the run_at value for expression jobs. This makes the polling happen much quicker, and will allow the expression job to (hopefully) fail before the clustering & metadata:

# app/models/ingest_job.rb, line 385
else
run_at = action == :ingest_expression ? 10.seconds.from_now : run_at
Rails.logger.info "IngestJob poller: #{pipeline_name} is not done; queuing check for #{run_at}"
delay(run_at: run_at).poll_for_completion
end

Boot all services (including DelayedJob) and sign in
Create a new study and upload the file, specifying a clustering in the X_umap or X_tsne slots, but select No for raw counts data

In development.log, look for the job parameters and note that raw_counts is not listed in extract:

Remote found for trimmed_compliant_pbmc3K.h5ad:66d8c20b01de553d34536123, launching Ingest job
Request object sent to Google Life Sciences API, excluding 'environment' parameters:
---
:actions:
:commands:
- python
- ingest_pipeline.py
- "--study-id"
- 66d8bcf394ec8f692a4ebea8
- "--study-file-id"
- 66d8c20b01de553d34536123
- "--user-metrics-uuid"
- c6a08597-c735-4c93-b069-4b4a74b708ee
- ingest_anndata
- "--ingest-anndata"
- "--anndata-file"
- gs://fc-b04eabc0-55c0-4047-9a02-5d0f98a8f528/trimmed_compliant_pbmc3K.h5ad
- "--obsm-keys"
- '["X_umap"]'
- "--extract"
- '["cluster", "metadata", "processed_expression"]'
:image_uri: gcr.io/broad-singlecellportal-staging/scp-ingest-pipeline:1.35.0
...

Wait for the expression job to fail, looking for an email (or log message) that states 'trimmed_compliant_pbmc3K.h5ad' has failed during parsing. Duplicate values are not allowed. There are 1 duplicates in the gene file.

In development.log, once the cluster/metadata jobs complete, look for the similar messages:

IngestJob poller: projects/342448556558/locations/us-central1/operations/16392329314662791330 is done!
IngestJob poller: projects/342448556558/locations/us-central1/operations/16392329314662791330 status: Completed
Checking for secondary cleanup on 66d8c20b01de553d34536123 after projects/342448556558/locations/us-central1/operations/16392329314662791330 completion
Performing secondary cleanup on 66d8c20b01de553d34536123 due to upstream failure
Removing all ClusterGroup records for 66d8c20b01de553d34536123
Removing all CellMetadatum records for 66d8c20b01de553d34536123
Removing all Gene records for 66d8c20b01de553d34536123
Removing all DataArray records for 66d8c20b01de553d34536123
Secondary cleanup for 66d8c20b01de553d34536123 complete
...
IngestJob poller: projects/342448556558/locations/us-central1/operations/3261380939259729291 is done!
IngestJob poller: projects/342448556558/locations/us-central1/operations/3261380939259729291 status: Completed
Checking for secondary cleanup on 66d8c20b01de553d34536123 after projects/342448556558/locations/us-central1/operations/3261380939259729291 completion
Performing secondary cleanup on 66d8c20b01de553d34536123 due to upstream failure
Removing all ClusterGroup records for 66d8c20b01de553d34536123
Removing all CellMetadatum records for 66d8c20b01de553d34536123
Removing all Gene records for 66d8c20b01de553d34536123
Removing all DataArray records for 66d8c20b01de553d34536123
Secondary cleanup for 66d8c20b01de553d34536123 complete

Load the "Study Details" page for this study, and confirm the study is still empty (no genes, clusters, or metadata)

Codecov Report

Attention: Patch coverage is 60.71429% with 11 lines in your changes missing coverage. Please review.

Project coverage is 69.62%. Comparing base (8f0f038) to head (8c043ec). Report is 4 commits behind head on development.

Files with missing lines	Patch %	Lines
app/models/ingest_job.rb	50.00%	11 Missing :warning:

Additional details and impacted files

[![Impacted file tree graph](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2121/graphs/tree.svg?width=650&height=150&src=pr&token=HMWE5BO2a4&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute)](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2121?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute) ```diff @@ Coverage Diff @@ ## development #2121 +/- ## =============================================== + Coverage 69.55% 69.62% +0.07% =============================================== Files 327 327 Lines 27556 27572 +16 Branches 2289 2289 =============================================== + Hits 19166 19197 +31 + Misses 8256 8241 -15 Partials 134 134 ``` | [Files with missing lines](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2121?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute) | Coverage Δ | | |---|---|---| | [app/lib/file\_parse\_service.rb](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2121?src=pr&el=tree&filepath=app%2Flib%2Ffile_parse_service.rb&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL2xpYi9maWxlX3BhcnNlX3NlcnZpY2UucmI=) | `61.98% <ø> (ø)` | | | [app/models/ann\_data\_ingest\_parameters.rb](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2121?src=pr&el=tree&filepath=app%2Fmodels%2Fann_data_ingest_parameters.rb&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL21vZGVscy9hbm5fZGF0YV9pbmdlc3RfcGFyYW1ldGVycy5yYg==) | `100.00% <100.00%> (ø)` | | | [app/models/ingest\_job.rb](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2121?src=pr&el=tree&filepath=app%2Fmodels%2Fingest_job.rb&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute#diff-YXBwL21vZGVscy9pbmdlc3Rfam9iLnJi) | `44.57% <50.00%> (+2.99%)` | :arrow_up: | ... and [3 files with indirect coverage changes](https://app.codecov.io/gh/broadinstitute/single_cell_portal_core/pull/2121/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=broadinstitute)

broadinstitute / single_cell_portal_core