Open davidsmejia opened 4 years ago
@srp33: is there a reason this is a warning instead of an error?
Sorry for the delay. I designed to give a warning so that the whole dataset wouldn't fail if one or a few samples had a problem.
Ok - so we should record that those samples failed with the above error. @srp33 : would it be easy to make a strict or single-sample mode that raises any warnings that result in a file failing to process as an error instead?
Are you saying to fail the whole dataset if there's an error with any of the samples in the dataset? @cgreene
I am imagining a scenario where someone has one task that downloads a dataset and extracts an archive, and then a separate task for each sample that processes the individual sample with SCAN.
I'm not sure if I understand the full vision of what you're wanting to do, but at least for Affymetrix samples, you should be able to specify a GSM ID rather than a GSE ID to process a single sample. (At least that's how I remember coding it.)
Ahh - yes - we can run it one sample at a time. I'm just wondering if it would be easy to add an option where, at the user's request, any warning that would lead to a sample being left out of the result produces an error instead (which we catch and handle, as opposed to warnings which we generally don't).
Unfortunately, it wouldn't be a quick and easy solution to put that in the Bioconductor version. But it would probably be doable to build a custom version of SCAN.UPC that would behave this way. However, before I do that, I wonder if it would work instead to run it in verbose mode. If the warning is shown in verbose mode, perhaps your scripts could catch and handle that.
Context
After running GSE131617 today there were 2 samples which
failure_reason
was that the file was missing at the time of sha1 generation. This was because SCAN.UPC was not actually writing to the output file since one of the silenced warnings was:RRuntimeWarning: /home/user/data_store/raw/TEST/CEL/GSM3791176_Expression_BN_V-VI_13-FC_090612.CEL has a disproportionate number of zero values, so it cannot be processed.
Problem or idea
We don't have a good log of why processing this failed as it looks like the file is just missing.
Solution or next step
We could capture the output of the warnings after
scan_upc
runs then check if the output was generated. If not, throw an exception with the last warning provided fromscan_upc
. This will give us a better way of determining which samples are salvageable.This would happen at the end of
processors/array_express.py:_run_scan_upc