Closed bistline closed 1 month ago
Attention: Patch coverage is 60.71429%
with 11 lines
in your changes missing coverage. Please review.
Project coverage is 69.62%. Comparing base (
8f0f038
) to head (8c043ec
). Report is 4 commits behind head on development.
Files with missing lines | Patch % | Lines |
---|---|---|
app/models/ingest_job.rb | 50.00% | 11 Missing :warning: |
BACKGROUND
In the AnnData UX, after a file finishes the extraction phase, the 3 main datatypes (expression, metadata, clustering) are all ingested in parallel (if specified). While this is the most performant way to process data, it leads to corner cases where one process fails (expression ingest, for instance) but the other processes succeed. Because the data cleanup happens immediately upon detection of the failure, any processes that are still writing data will not be affected. This leads to a case where orphaned data exists in the database that cannot be cleaned up by the user, and effectively blocks all further ingests using the AnnData UX.
CHANGES
Now, upon successful completion of an ingest process, a "secondary cleanup" check is run to determine if the file has been queued for deletion by another ingest run. This will in practice only apply to AnnData files, as other file types are ingested atomically, and ancillary processes like subsampling or differential expression have their own cleanup policies. Additionally, this will skip sending a "success" email to a user with a file that is in the process of being deleted, meaning the last email the user gets about the file is the failure that caused the delete cascade.
Additionally, this fixes a bug where AnnData files that were not marked as having raw counts data were still extracting raw cell names. Now, the
raw_counts
extraction only happens when specified.MANUAL TESTING
TL;DR - this is nearly impossible to test manually. It relies on having a file where only one data type fails to validate, but the jobs are spaced out just so that one fails before the others succeed. That being said, I was able to get it to work using the following steps, though it requires manually changing the polling interval for some jobs.
app/models/ingest_job.rb
, modify theelse
block inpoll_for_completion
to change therun_at
value for expression jobs. This makes the polling happen much quicker, and will allow the expression job to (hopefully) fail before the clustering & metadata:X_umap
orX_tsne
slots, but select No for raw counts datadevelopment.log
, look for the job parameters and note thatraw_counts
is not listed inextract
: