Closed kurtwheeler closed 6 years ago
If, in the future, we would like to enable users to perform differential gene expression analysis for an experiment processed through the refinery, we should consider generating files (e.g., .RDS or .Rda files) suitable for use with DESeq2::DESeqDataSetFromTximport
.
If so, we should probably run tximport::summarizeToGene
twice:
countsFromAbundance = "no"
-- the resulting object is to be saved for use with DESeq2 (also suitable for use with edgeR, will contain offset info)countsFromAbundance = "lengthScaledTPM"
, as we're thinking that this will likely be most useful for downstream ML/data integration applications with some additional processing (also can be used with limma-voom)See the differential gene expression section of the tximport vignette for more information.
Running both ways will allow us to be the most flexible. It's also entirely possible that lengthScaledTPM (option 2) can be obtained using option 1, would require more thought on my part, though going that route may not be desirable. I wanted to get these initial thoughts down for the moment.
@cgreene and @dvenprasad - Do you have any thoughts about saving an R object for use with downstream differential expression analyses? What additional info (e.g., file size for some experiments) would be helpful for you to weigh in?
My understanding is that the R object would primarily be useful for differential expression analysis of the single experiment in question. Is that correct?
And knowing filesize (which would let us estimate cost to store) would be great. If it is negligible, then I don't see a downside other than needing to design the UI to account for/explain many types of downloads.
My understanding is that the R object would primarily be useful for differential expression analysis of the single experiment in question. Is that correct?
Yes, that is correct.
I have few questions about the R object:
What is the skill level required from the users incorporate the R object easily into their workflow? (On a scale of 1-5, 1 being novice and 5 being expert)
I would say 3+, you'll need familiarity with R.
Is it plug and play or are there additional steps/inputs needed to use the R object?
One would need to read the .RDS into R and then pass it to DESeq2::DESeqDataSetFromTximport
, but the user would also have to supply the colData
and design
arguments for that function (in the DESeq2 case). Then it would be ready for use with DESeq2::DESeq
.
So the files for download would be the R object + the RDS file? And this would be targeted mainly towards more advanced users?
I'm not opposed to including it but I'm not sure how useful it would be. From what I can tell from speaking to users, they are more interested in having resources which would help them do differential analysis across datasets.
So the files for download would be the R object + the RDS file?
The "R object" is saved as an RDS file, I apologize that was unclear.
And this would be targeted mainly towards more advanced users? I'm not opposed to including it but I'm not sure how useful it would be.
So I'm imagining that it could make it a little quicker to do differential expression analysis for a single dataset at the moment. In the future, it could be useful to have if we have a graphical user interface where a less advanced user could group samples and that grouping info would be supplied as those other arguments I mentioned.
I'm gonna snag a human experiment with ~500 samples and subsample it (10, 25, 50, 100, 250, 500 samples) and compare the RDS size to the (uncompressed) tsv -- would that be sufficient information about the file size @cgreene?
That sounds reasonable to estimate sizes.
-- Casey S. Greene, Ph.D.
Assistant Professor Dept. of Systems Pharmacology and Translational Therapeutics Perelman School of Medicine University of Pennsylvania web: http://www.greenelab.com phone: 215-573-2991
Director Childhood Cancer Data Lab Alex's Lemonade Stand Foundation web: http://ccdatalab.org
Related, I think: https://github.com/AlexsLemonade/refinebio/issues/113
Can any of you explain to me what "blessed ref" means in the picture? (The handwriting in the yellow box reads: tximport + "blessed" ref
.)
By the way, @jaclyn-taroni, are we going to save the data frame file in the processor (as you did in your sample R script)?
Can any of you explain to me what "blessed ref" means in the picture? (The handwriting in the yellow box reads: tximport + "blessed" ref.)
"Blessed ref" refers to the reference genome used to generate the transcriptome index (using salmon index
). Instead of using just the coding sequences (cDNA) of a genome, we're going to use all of the genome and specifically filter out pseudogenes, since we expect including pseudogenes will negatively impact our ability to quantify coding sequences.
During the process of "refining" indices for all organisms, we will generate a two-column text file (tsv
?) that maps between transcripts and genes -- this is given to rsem-prepare-reference
{genes_to_transcripts}
here
https://github.com/AlexsLemonade/refinebio/blob/261d057ac69e0e731e747918406c9f395a91cc12/workers/data_refinery_workers/processors/transcriptome_index.py#L162
We'll need to take that same file and use it here with tximport
, but I believe the order of the columns will need to be changed. (I think rsem-prepare-reference
wants columns genes
transcripts
and tximport
requires transcripts
genes
, but check the documentation to be sure.)
By the way, @jaclyn-taroni, are we going to save the data frame file in the processor (as you did in your sample R script)?
The original plan was to save the lengthScaledTPM
(last bit in my script). The question posed above (here specifically https://github.com/AlexsLemonade/refinebio/issues/82#issuecomment-366818673) was whether or not we should also save an RDS
where we have not used countsFromAbundance
. We have not answered that question.
In order to address this
I'm gonna snag a human experiment with ~500 samples and subsample it (10, 25, 50, 100, 250, 500 samples) and compare the RDS size to the (uncompressed) tsv
It would be helpful to run ERP001942
through the refine.bio Salmon pipeline (since we're setup with Aspera).
@dongbohu asked me what the distribution of sample sizes (per experiment) we expect to encounter would look like. To do a quick check, I downloaded a list of experiments from the recount2 web app and summarized the number of samples:
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 4.00 8.00 24.57 18.00 1720.00
A couple notes: there are some single-cell experiments included, which will skew the results (as they generally have higher sample sizes) and this is human only. I expect human studies may skew towards higher sample sizes because there should be plenty of cases where non-genetically identical samples (e.g., solid tissue biopsies or primary cells rather than cell lines) are being studied.
Note that what we need to do the aforementioned subsampling experiment is the quant.sf
files output from salmon quant
I decided not to run ERP001942
until the related fixes by @Miserlou are merged to dev
. I spent a lot of time patching the code from various branches but it still didn't work. I will focus on the tximport implementation part for now.
Latest RSEM package: https://github.com/deweylab/RSEM/archive/v1.3.0.tar.gz
The tximport
analysis is experiment level analysis, but the salmon quant
step before tximport
is done at sample level. So we need to implement a mechanism to collect all sample analysis results together before starting tximport
.
Since the rest of the processors produce output that is on a single-sample level, it probably makes sense to split up the tximport
results into "single-sample PCL files" where the first column is the ENSG gene identifiers and the second column contains the expression values for that sample.
@kurtwheeler Is it safe to assume that all samples of a certain experiment will be located in the same data directory? We need this information before doing tximport
analysis.
@jaclyn-taroni and I tested the size of the RDS file based on an experiment that includes 6 samples. The size if 2.8 MB. If its size is linear, it is probably acceptable to save the whole txi.out
as RDS.
Sorry I just saw this:
@kurtwheeler Is it safe to assume that all samples of a certain experiment will be located in the same data directory? We need this information before doing tximport analysis.
Yes. That is correct. I determined this by looking at what the SRA downloader is doing here, which is to put the downloaded files in a directory named after the job's accession_code, which you can see here is set to be the experiment's accession code.
Done in PR #278.
Use tximport along with the gene-to-transcript mapping already contained within the processor to implement this part of the salmon pipeline:
This code does essentially what we need, we just need to include this into the Data Refinery's salmon pipeline: https://github.com/jaclyn-taroni/ref-txome/blob/79e2f64ffe6a71c5103a150bd3159efb784cddeb/4-athaliana_tximport.R (Note that this script contains a link to a tutorial.)
Note that this should be done on a per-experiment basis, rather than a per-sample basis.