AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Implement tximport for RNA-seq pipeline #82

Closed kurtwheeler closed 6 years ago

kurtwheeler commented 6 years ago

Use tximport along with the gene-to-transcript mapping already contained within the processor to implement this part of the salmon pipeline:

gene_level_counts

This code does essentially what we need, we just need to include this into the Data Refinery's salmon pipeline: https://github.com/jaclyn-taroni/ref-txome/blob/79e2f64ffe6a71c5103a150bd3159efb784cddeb/4-athaliana_tximport.R (Note that this script contains a link to a tutorial.)

Note that this should be done on a per-experiment basis, rather than a per-sample basis.

jaclyn-taroni commented 6 years ago

If, in the future, we would like to enable users to perform differential gene expression analysis for an experiment processed through the refinery, we should consider generating files (e.g., .RDS or .Rda files) suitable for use with DESeq2::DESeqDataSetFromTximport.

If so, we should probably run tximport::summarizeToGene twice:

  1. with countsFromAbundance = "no" -- the resulting object is to be saved for use with DESeq2 (also suitable for use with edgeR, will contain offset info)
  2. with countsFromAbundance = "lengthScaledTPM", as we're thinking that this will likely be most useful for downstream ML/data integration applications with some additional processing (also can be used with limma-voom)

See the differential gene expression section of the tximport vignette for more information.

Running both ways will allow us to be the most flexible. It's also entirely possible that lengthScaledTPM (option 2) can be obtained using option 1, would require more thought on my part, though going that route may not be desirable. I wanted to get these initial thoughts down for the moment.

jaclyn-taroni commented 6 years ago

@cgreene and @dvenprasad - Do you have any thoughts about saving an R object for use with downstream differential expression analyses? What additional info (e.g., file size for some experiments) would be helpful for you to weigh in?

cgreene commented 6 years ago

My understanding is that the R object would primarily be useful for differential expression analysis of the single experiment in question. Is that correct?

And knowing filesize (which would let us estimate cost to store) would be great. If it is negligible, then I don't see a downside other than needing to design the UI to account for/explain many types of downloads.

jaclyn-taroni commented 6 years ago

My understanding is that the R object would primarily be useful for differential expression analysis of the single experiment in question. Is that correct?

Yes, that is correct.

dvenprasad commented 6 years ago

I have few questions about the R object:

jaclyn-taroni commented 6 years ago

What is the skill level required from the users incorporate the R object easily into their workflow? (On a scale of 1-5, 1 being novice and 5 being expert)

I would say 3+, you'll need familiarity with R.

Is it plug and play or are there additional steps/inputs needed to use the R object?

One would need to read the .RDS into R and then pass it to DESeq2::DESeqDataSetFromTximport, but the user would also have to supply the colData and design arguments for that function (in the DESeq2 case). Then it would be ready for use with DESeq2::DESeq.

dvenprasad commented 6 years ago

So the files for download would be the R object + the RDS file? And this would be targeted mainly towards more advanced users?

I'm not opposed to including it but I'm not sure how useful it would be. From what I can tell from speaking to users, they are more interested in having resources which would help them do differential analysis across datasets.

jaclyn-taroni commented 6 years ago

So the files for download would be the R object + the RDS file?

The "R object" is saved as an RDS file, I apologize that was unclear.

And this would be targeted mainly towards more advanced users? I'm not opposed to including it but I'm not sure how useful it would be.

So I'm imagining that it could make it a little quicker to do differential expression analysis for a single dataset at the moment. In the future, it could be useful to have if we have a graphical user interface where a less advanced user could group samples and that grouping info would be supplied as those other arguments I mentioned.

jaclyn-taroni commented 6 years ago

I'm gonna snag a human experiment with ~500 samples and subsample it (10, 25, 50, 100, 250, 500 samples) and compare the RDS size to the (uncompressed) tsv -- would that be sufficient information about the file size @cgreene?

cgreene commented 6 years ago

That sounds reasonable to estimate sizes.

-- Casey S. Greene, Ph.D.

Assistant Professor Dept. of Systems Pharmacology and Translational Therapeutics Perelman School of Medicine University of Pennsylvania web: http://www.greenelab.com phone: 215-573-2991

Director Childhood Cancer Data Lab Alex's Lemonade Stand Foundation web: http://ccdatalab.org

Miserlou commented 6 years ago

Related, I think: https://github.com/AlexsLemonade/refinebio/issues/113

dongbohu commented 6 years ago

Can any of you explain to me what "blessed ref" means in the picture? (The handwriting in the yellow box reads: tximport + "blessed" ref.)

dongbohu commented 6 years ago

By the way, @jaclyn-taroni, are we going to save the data frame file in the processor (as you did in your sample R script)?

jaclyn-taroni commented 6 years ago

Can any of you explain to me what "blessed ref" means in the picture? (The handwriting in the yellow box reads: tximport + "blessed" ref.)

"Blessed ref" refers to the reference genome used to generate the transcriptome index (using salmon index). Instead of using just the coding sequences (cDNA) of a genome, we're going to use all of the genome and specifically filter out pseudogenes, since we expect including pseudogenes will negatively impact our ability to quantify coding sequences.

During the process of "refining" indices for all organisms, we will generate a two-column text file (tsv?) that maps between transcripts and genes -- this is given to rsem-prepare-reference

{genes_to_transcripts} here https://github.com/AlexsLemonade/refinebio/blob/261d057ac69e0e731e747918406c9f395a91cc12/workers/data_refinery_workers/processors/transcriptome_index.py#L162

We'll need to take that same file and use it here with tximport, but I believe the order of the columns will need to be changed. (I think rsem-prepare-reference wants columns genes transcripts and tximport requires transcripts genes, but check the documentation to be sure.)

jaclyn-taroni commented 6 years ago

By the way, @jaclyn-taroni, are we going to save the data frame file in the processor (as you did in your sample R script)?

The original plan was to save the lengthScaledTPM (last bit in my script). The question posed above (here specifically https://github.com/AlexsLemonade/refinebio/issues/82#issuecomment-366818673) was whether or not we should also save an RDS where we have not used countsFromAbundance. We have not answered that question.

In order to address this

I'm gonna snag a human experiment with ~500 samples and subsample it (10, 25, 50, 100, 250, 500 samples) and compare the RDS size to the (uncompressed) tsv

It would be helpful to run ERP001942 through the refine.bio Salmon pipeline (since we're setup with Aspera).

jaclyn-taroni commented 6 years ago

@dongbohu asked me what the distribution of sample sizes (per experiment) we expect to encounter would look like. To do a quick check, I downloaded a list of experiments from the recount2 web app and summarized the number of samples:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    4.00    8.00   24.57   18.00 1720.00 

A couple notes: there are some single-cell experiments included, which will skew the results (as they generally have higher sample sizes) and this is human only. I expect human studies may skew towards higher sample sizes because there should be plenty of cases where non-genetically identical samples (e.g., solid tissue biopsies or primary cells rather than cell lines) are being studied.

jaclyn-taroni commented 6 years ago

Note that what we need to do the aforementioned subsampling experiment is the quant.sf files output from salmon quant

dongbohu commented 6 years ago

I decided not to run ERP001942 until the related fixes by @Miserlou are merged to dev. I spent a lot of time patching the code from various branches but it still didn't work. I will focus on the tximport implementation part for now.

dongbohu commented 6 years ago

Latest RSEM package: https://github.com/deweylab/RSEM/archive/v1.3.0.tar.gz

dongbohu commented 6 years ago

The tximport analysis is experiment level analysis, but the salmon quant step before tximport is done at sample level. So we need to implement a mechanism to collect all sample analysis results together before starting tximport.

jaclyn-taroni commented 6 years ago

Since the rest of the processors produce output that is on a single-sample level, it probably makes sense to split up the tximport results into "single-sample PCL files" where the first column is the ENSG gene identifiers and the second column contains the expression values for that sample.

dongbohu commented 6 years ago

@kurtwheeler Is it safe to assume that all samples of a certain experiment will be located in the same data directory? We need this information before doing tximport analysis.

dongbohu commented 6 years ago

@jaclyn-taroni and I tested the size of the RDS file based on an experiment that includes 6 samples. The size if 2.8 MB. If its size is linear, it is probably acceptable to save the whole txi.out as RDS.

kurtwheeler commented 6 years ago

Sorry I just saw this:

@kurtwheeler Is it safe to assume that all samples of a certain experiment will be located in the same data directory? We need this information before doing tximport analysis.

Yes. That is correct. I determined this by looking at what the SRA downloader is doing here, which is to put the downloaded files in a directory named after the job's accession_code, which you can see here is set to be the experiment's accession code.

dongbohu commented 6 years ago

Done in PR #278.