AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
128 stars 19 forks source link

SCAN.UPC silently fails #2019

Open davidsmejia opened 4 years ago

davidsmejia commented 4 years ago

Context

After running GSE131617 today there were 2 samples which failure_reason was that the file was missing at the time of sha1 generation. This was because SCAN.UPC was not actually writing to the output file since one of the silenced warnings was:

RRuntimeWarning: /home/user/data_store/raw/TEST/CEL/GSM3791176_Expression_BN_V-VI_13-FC_090612.CEL has a disproportionate number of zero values, so it cannot be processed.

Problem or idea

We don't have a good log of why processing this failed as it looks like the file is just missing.

Solution or next step

We could capture the output of the warnings after scan_upc runs then check if the output was generated. If not, throw an exception with the last warning provided from scan_upc. This will give us a better way of determining which samples are salvageable.

This would happen at the end of processors/array_express.py:_run_scan_upc

cgreene commented 4 years ago

@srp33: is there a reason this is a warning instead of an error?

srp33 commented 4 years ago

Sorry for the delay. I designed to give a warning so that the whole dataset wouldn't fail if one or a few samples had a problem.

cgreene commented 4 years ago

Ok - so we should record that those samples failed with the above error. @srp33 : would it be easy to make a strict or single-sample mode that raises any warnings that result in a file failing to process as an error instead?

srp33 commented 4 years ago

Are you saying to fail the whole dataset if there's an error with any of the samples in the dataset? @cgreene

cgreene commented 4 years ago

I am imagining a scenario where someone has one task that downloads a dataset and extracts an archive, and then a separate task for each sample that processes the individual sample with SCAN.

srp33 commented 4 years ago

I'm not sure if I understand the full vision of what you're wanting to do, but at least for Affymetrix samples, you should be able to specify a GSM ID rather than a GSE ID to process a single sample. (At least that's how I remember coding it.)

cgreene commented 4 years ago

Ahh - yes - we can run it one sample at a time. I'm just wondering if it would be easy to add an option where, at the user's request, any warning that would lead to a sample being left out of the result produces an error instead (which we catch and handle, as opposed to warnings which we generally don't).

srp33 commented 4 years ago

Unfortunately, it wouldn't be a quick and easy solution to put that in the Bioconductor version. But it would probably be doable to build a custom version of SCAN.UPC that would behave this way. However, before I do that, I wonder if it would work instead to run it in verbose mode. If the warning is shown in verbose mode, perhaps your scripts could catch and handle that.