AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Encountered error in R code while running gene_convert_illumina.R pipeline during processing of /... #571

Open cgreene opened 6 years ago

cgreene commented 6 years ago

Pretty sure the line with the error message should be using result.returncode and result.stderr. Not e. Assigned myself but if anyone else wants to tackle it before I get to things today, feel free. Should be a quick fix.

https://sentry.io/greenelab/staging-refinebio/issues/667326691/

Encountered error in R code while running gene_convert_illumina.R pipeline during processing of /home/user/data_store/GSE98897/raw/GSM2627179-tbl-1.txt.fixed: Can't convert 'NoneType' object to str implicitly
jaclyn-taroni commented 6 years ago

I've dug into this a bit more and merged the issues together as they arise during the processing of a single accession: https://sentry.io/greenelab/staging-refinebio/issues/667302064/?query=is:unresolved%20gene_convert_illumina

On #491, I randomly selected Illumina experiments that I knew were on supported platforms and that included GSE98897. GSE98897 happens to be a SuperSeries that includes GSE98895 which is on a supported platform and the way I constructed my list would not account for that (see also: https://github.com/jaclyn-taroni/beadarray-platform-detection#results). The other samples under the SuperSeries umbrella were run on miRNA arrays which we do not support.

Given that there is likely a bug in the python code as noted above, it's hard to say exactly what happened here. I would expect that we would not attempt to convert the gene ids in these samples because they are not on a supported platform.

jaclyn-taroni commented 6 years ago

I forgot to mention that there are 15 of these events and I would expect this to occur for all 40 samples on the miRNA arrays.

cgreene commented 6 years ago

It's possible that some got rate limited. It looks like ~25% were dropped due to rate limiting. Though 15 isn't 25% of 40, if this experiment was handled at a time where a higher proportion was dropped that could explain the lower number of reports.

jaclyn-taroni commented 6 years ago

Digging a bit further, the miRNA array probe identifiers in GSM2627173-tbl-1.txt, one of the failures, do not appear to overlap with any of the whole genome chip identifiers (Human v1-4) which I would expect to get caught at the "detect database" step

https://github.com/AlexsLemonade/refinebio/blob/dev/workers/data_refinery_workers/processors/no_op.py#L187