AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Humanize Pipeline Names #170

Closed Miserlou closed 6 years ago

Miserlou commented 6 years ago

Currently "pipelines" are represented internally as long UNIX commands or as R functions.

When we display this information, we'll want to have descriptive, intuitive, human-friendly ways of representing this information..

I'm guessing we'll probably need a Pipeline table just to house name and description information for each of our pipelines.

Miserlou commented 6 years ago

From @jaclyn-taroni:

Salmon is an alignment-free method for estimating transcript abundances from RNA-seq data. We use it in quasi-mapping mode, which is significantly faster than alignment-based approaches and requires us to build a Salmon transcriptome index. We build a custom reference transcriptome (using RSEM rsem-prepare-reference) by filtering the Ensembl genomic DNA assembly to remove pseudogenes, which we expect could negatively impact the quantification of protein-coding genes. This means we're obtaining abundance estimates for coding as well as non-coding transcripts. We include the flags --seqBias to correct for random hexamer priming and, if this is a paired-end experiment, --gcBias to correct for GC content when running salmon quant.

SCAN (Single Channel Array Normalization) is a normalization method for single channel (Affymetrix) microarrays that allows us to process individual samples. SCAN models and corrects for the effect of technical bias, such as GC content, using a mixture-modeling approach. For more information about this approach, see the primary publication (Piccolo, et al. Genomics. 2012. DOI: 10.1016/j.ygeno.2012.08.003) and the SCAN.UPC bioconductor package documentation (DOI: 10.18129/B9.bioc.SCAN.UPC).

Miserlou commented 6 years ago

Rich [3:32 PM] Does “Abundance Estimation” and “Array Normalized” make sense

jaclyn.taroni [3:38 PM] RNA-seq abundance estimation makes sense and you can probably just say Single Channel Array Normalized

Miserlou commented 6 years ago

We need one for "NO-OP".

jaclyn-taroni commented 6 years ago

"Submitter-processed" ?

Miserlou commented 6 years ago

That's what I was thinking, but it seems vague to the point of usefulness. Can we be more descriptive than "processed"?

jaclyn-taroni commented 6 years ago

Well, it will likely be processed in multiple (not-entirely-standardized) ways and the field on GEO that tells us that is data processing. So unless we normalize that information (which may be possible), I don't know how we would get more specific.

kurtwheeler commented 6 years ago

@jaclyn-taroni can you provide a human readable name for each pipeline we have?

jaclyn-taroni commented 6 years ago

Chatted with @dvenprasad about this. We think the users that will be most interested in this information at a glance (e.g., in the sample table without addition documentation) will be interested in the specific tools used in the pipeline. That is to say that I don't think "Abundance estimation" is an appropriate level of detail.

With that in mind, here's what I propose for the humanized pipeline names:

Illumina BeadArray processor: Illumina SCAN Affymetrix processor: Affymetrix SCAN NOOP*: Submitter-processed RNA-seq: Salmon and tximport

*As mentioned above, we're probably not gonna be able to get more specific than that because it will not be standardized

dvenprasad commented 6 years ago

@jaclyn-taroni Can you provide copy for tximport, please?

jaclyn-taroni commented 6 years ago

Here are the "read more" links for Salmon:

We might not want to use every single one of these, so should discuss options. Also might want to include this: http://deweylab.biostat.wisc.edu/rsem/rsem-prepare-reference.html

jaclyn-taroni commented 6 years ago

Here's a draft of the tximport copy:

tximport imports transcript (tx)-level abundance estimates generated by salmon quant and summarizes them to the gene-level. We use the tx to gene mapping generated as part of our reference transcriptome processing pipeline. Our tximport implementation generates "lengthScaledTPM", which are gene-level counts that are generated by scaling TPM using the average transcript length across samples and to the library size. Note that tximport is applied at the experiment-level rather than to single samples. For additional information, see the tximport Bioconductor page, the tximport tutorial Importing transcript abundance datasets with tximport, and Soneson, et al. F1000Research. 2015.

jaclyn-taroni commented 6 years ago

Looks like @dvenprasad has all the copy she needs for Keytar Kurt, so I will close this.