AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
126 stars 19 forks source link

Questions about handling of submitter-processed Illumina BeadArrays #3405

Closed jaclyn-taroni closed 7 months ago

jaclyn-taroni commented 9 months ago

Context

We had a user question come in about a submitter-processed Illumina BeadArray experiment that was partly about why there were fewer features when downloading from refine.bio vs. GEO.

I thought we were using the Illumina refinery TSVs for both refine.bio-processed and submitter-processed data and said as much, but in reviewing the code, I think the dropoff could exclusively be because of the mapping from probes to Ensembl gene IDs using a given annotation package on Bioconductor.

The user also asked about log2-transformation. Our docs say we perform some modifications to submitter-processed data and use log2 transformation as an example.

Problem or idea

I'm not the most acquainted with our codebase at this point, so I'll need some Engineering support to help answer the following questions:

Solution or next step

@davidsmejia, can we please slot this for the next sprint beginning October 9th?

davidsmejia commented 8 months ago

I think the best outline of our intent for how we handle samples can be found here: https://github.com/AlexsLemonade/refinebio/blob/dev/foreman/data_refinery_foreman/surveyor/array_express.py#L413

There are many possible data situations for a sample:

  • If the sample only has raw data available:
    • If it is on a platform that we support:
    • Download this raw data and process it
    • If it is not on a platform we support:
    • Don't download anything, don't process anything
  • If the sample has both raw and derived data:
    • If the raw data is on a platform we support:
    • Download the raw data and process it, abandon the derived data
    • If the raw data is not on a platform we support
    • Download the derived data and no-op it, abandon the raw data
  • If the sample only has derived data:
    • Download the derived data and no-op it.

So an array express surveyor creates an array_express downloader, which then creates a processor job based on this method here: https://github.com/AlexsLemonade/refinebio/blob/dev/common/data_refinery_common/job_lookup.py#L96

So submitter-processed data only goes to NO_OP pipeline, otherwise we would run the appropriate _TO_PCL pipeline.

It looks like log2-transformation only occurs to submitter-processed data at time of creating a downloadable dataset.

Log2-transformation

Where, if anywhere, do we perform log2-transformation for submitter-processed microarray data (both Affymetrix and Illumina)? (I expect most submitter-processed microarray datasets to be downloaded from the web interface and quantile normalized, so this step might not matter much in practice and requires a documentation change only.)

I do not see any evidence that log2 scaling occurs in the NO_OP pipeline, but if you notice the comment in smashing_utils it sounds like we are supposed to.

Illumina Probe Maps

Do we use the Illumina refinery TSVs for submitter-processed data (i.e., the "NOOP" pipeline)?

It does not look like we do. We only use this in the ILLUMINA_TO_PCL pipeline in illumina.py which calls illumina.R.

This is where we use the TSVs in illumina.R. https://github.com/AlexsLemonade/refinebio/blob/dev/workers/data_refinery_workers/processors/illumina.R#L428

Let me know if I can clarify anything.