AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Dataset request GSE31704 #2274

Open jaclyn-taroni opened 4 years ago

jaclyn-taroni commented 4 years ago

Context

A user requested GSE31704.

Problem or idea

GSE31704 is an Illumina HumanHT-12 V4.0 expression beadchip experiment with 6 samples - none of which appear to be downloadable from refine.bio.

Copying an error message from the debug information from GSM786857


Encountered  error in R code while running illumina.R pipeline during processing of  /home/user/data_store/processor_job_1319230/GSE31704_non-normalized.txt.sanitized:  Command '['/usr/bin/Rscript', '--vanilla',  '/home/user/data_refinery_workers/processors/illumina.R', '--probeId',  'ID_REF', '--expression', '8,6,12,10,2,4,2,4,6,8,10,12', '--detection',  'Detection Pval', '--platform', 'illuminaHumanv4', '--inputFile',  '/home/user/data_store/processor_job_1319230/GSE31704_non-normalized.txt.sanitized',  '--outputFile',  '/home/user/data_store/processor_job_1319230/GSE31704_non-normalized.PCL',  '--cores', '64']' returned non-zero exit status 1

The following jumps out at me:

 '--expression', '8,6,12,10,2,4,2,4,6,8,10,12',

Which means we are specifying the column indices for columns that contain expression values twice.

This might have to do with the headers in GSE31704_non-normalized.txt - here's an excerpt:

ID_REF IMR90 control-A.AVG_Signal IMR90 control-A.Detection Pval IMR90 control-B.AVG_Signal IMR90 control-B.Detection Pval

Solution or next step

  1. Is it due to the specific headers in this file or is something else causing the error?
  2. Is this a case that we should be able to handle or would it harm our ability to process BeadChip data overall?
  3. Reprocess + notify or notify the user that we will not be able to handle this case.
jaclyn-taroni commented 4 years ago

I forgot to note that it looks like the last time we tried this experiment was December 2018. I think the most recent changes to the Illumina BeadChip processors had to do with the p-value columns (thinking specifically of #2130, here) and not the expression columns, but it's possible we have made other changes since then that would allow us to handle this experiment.

davidsmejia commented 4 years ago

I wrote a test to verify what would be matched, I'm not sure if we should be able to support this because of the ambiguity of the headers... making it hard for automation. We might want to consider adding additional logic to detect which columns should be detected.

Currently we would match the control pvalue column which would be incorrect.

Screen Shot 2020-05-26 at 11 43 56 AM