Closed Miserlou closed 6 years ago
From @jaclyn-taroni:
Salmon is an alignment-free method for estimating transcript abundances from RNA-seq data. We use it in quasi-mapping mode, which is significantly faster than alignment-based approaches and requires us to build a Salmon transcriptome index. We build a custom reference transcriptome (using RSEM rsem-prepare-reference) by filtering the Ensembl genomic DNA assembly to remove pseudogenes, which we expect could negatively impact the quantification of protein-coding genes. This means we're obtaining abundance estimates for coding as well as non-coding transcripts. We include the flags --seqBias
to correct for random hexamer priming and, if this is a paired-end experiment, --gcBias
to correct for GC content when running salmon quant
.
SCAN (Single Channel Array Normalization) is a normalization method for single channel (Affymetrix) microarrays that allows us to process individual samples. SCAN models and corrects for the effect of technical bias, such as GC content, using a mixture-modeling approach. For more information about this approach, see the primary publication (Piccolo, et al. Genomics. 2012. DOI: 10.1016/j.ygeno.2012.08.003) and the SCAN.UPC bioconductor package documentation (DOI: 10.18129/B9.bioc.SCAN.UPC).
Rich [3:32 PM] Does “Abundance Estimation” and “Array Normalized” make sense
jaclyn.taroni [3:38 PM] RNA-seq abundance estimation makes sense and you can probably just say Single Channel Array Normalized
We need one for "NO-OP".
"Submitter-processed" ?
That's what I was thinking, but it seems vague to the point of usefulness. Can we be more descriptive than "processed"?
Well, it will likely be processed in multiple (not-entirely-standardized) ways and the field on GEO that tells us that is data processing
. So unless we normalize that information (which may be possible), I don't know how we would get more specific.
@jaclyn-taroni can you provide a human readable name for each pipeline we have?
Chatted with @dvenprasad about this. We think the users that will be most interested in this information at a glance (e.g., in the sample table without addition documentation) will be interested in the specific tools used in the pipeline. That is to say that I don't think "Abundance estimation" is an appropriate level of detail.
With that in mind, here's what I propose for the humanized pipeline names:
Illumina BeadArray processor: Illumina SCAN
Affymetrix processor: Affymetrix SCAN
NOOP*: Submitter-processed
RNA-seq: Salmon and tximport
*As mentioned above, we're probably not gonna be able to get more specific than that because it will not be standardized
@jaclyn-taroni Can you provide copy for tximport, please?
Here are the "read more" links for Salmon:
We might not want to use every single one of these, so should discuss options. Also might want to include this: http://deweylab.biostat.wisc.edu/rsem/rsem-prepare-reference.html
Here's a draft of the tximport copy:
tximport
imports transcript (tx)-level abundance estimates generated by salmon quant
and summarizes them to the gene-level. We use the tx to gene mapping generated as part of our reference transcriptome processing pipeline. Our tximport implementation generates "lengthScaledTPM", which are gene-level counts that are generated by scaling TPM using the average transcript length across samples and to the library size. Note that tximport is applied at the experiment-level rather than to single samples. For additional information, see the tximport Bioconductor page, the tximport tutorial Importing transcript abundance datasets with tximport, and Soneson, et al. F1000Research. 2015.
Looks like @dvenprasad has all the copy she needs for Keytar Kurt, so I will close this.
Currently "pipelines" are represented internally as long UNIX commands or as R functions.
When we display this information, we'll want to have descriptive, intuitive, human-friendly ways of representing this information..
I'm guessing we'll probably need a Pipeline table just to house name and description information for each of our pipelines.