COMBINE-lab / salmon

šŸŸ šŸ£ šŸ± Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
774 stars 164 forks source link

Using salmon quantification with DeSeq2 #437

Open thkitapci opened 4 years ago

thkitapci commented 4 years ago

Hi, I run salmon then used quantmerge to combine the results as

salmon quantmerge --quantscat list_of_quant_folders--column numreads -o Merged_quants.txt

Can I use this as input for the DeSeq2 analysis ? One problem is "numreads" column is not integer and DeSeq2 requires integer input (read counts) can I convert the numbers in this column to integer then use as DeSeq2 input?

Thanks Hamdi

roryk commented 4 years ago

Hi Hamdi,

http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html has a really great description of how to use the output from Salmon with DESeq2.

thkitapci commented 4 years ago

Hi @roryk I know about tximport but is there any way to generate the input data for DESeq2 without using R? I am processing the data on one platform and then transfer to another platform for R/DESeq2 analysis. I would like to be able to generate the output of the first part (salmon) without using an R library.

If it is not possible and I have run R to get the count matrix for DEseq2, I can figure out a way to do it.

DESeq2 input file is a simple matrix of counts and "salmon quantmerge" already generates this, can you please explain to me why an external library is required ? Is there something I am missing that tximport package is doing to the data? Does tximport takes into account gene lengths or library size to generate the output?

Thanks Best Regards Hamdi

tamuanand commented 4 years ago

Hi Hamdi

Bit confused about your logic here - why would you not want to use tximport in R when your next step (DESeq2) is still going to be R?

I am curious to know your reasoning

I am processing the data on one platform and then transfer to another platform for R/DESeq2 analysis. I would like to be able to generate the output of the first part (salmon) without using an R library.

DustinChen1986 commented 4 years ago

This is a really torturous step for none R knowledge to use tximport for data transmission to DESeq2. Main problem is that the count matrix for DESeq2, which can be easily prepared by Python, must be integer, but tximport did not explain how to deal with the decimals.

qins commented 3 years ago

I also wonder why salmon not output original reads counts

update:

Because it is more accurate.

DeSeq2 should accept it.

As for now, maybe we could simply round to the nearest integer

NumReads ā€” This is salmonā€™s estimate of the number of reads mapping to each transcript that was quantified. It is an ā€œestimateā€ insofar as it is the expected number of reads that have originated from each transcript given the structure of the uniquely mapping and multi-mapping reads and the relative abundance estimates for each transcript.

rob-p commented 3 years ago

If you can already use DESeq2, then using tximport should not make it any harder at all. Given the tximport data, getting it into DESeq2 is as easy as

dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)

as shown in the tximport vignette.

Regarding outputting "original read counts"; salmon does output the estimates for the number of reads deriving from each transcript. If the question is, why is this number not an integer, that's because the best estimate (the maximum likelihood estimate) is often not integral. Tools that simply count reads (e.g. HTSeq) produce integer counts, but these are in no way "original read counts" for the corresponding genes, and are usually less accurate (farther from the true number of fragments deriving from a transcript / gene) than the estimates produced by salmon. The fact that the best estimate is often not an integer is a direct result of the fact one is considering a statistical model and taking expectations.

qins commented 3 years ago

Still maybe it's better to have an integer version of read counts file. There are cloud services and they do not accept non-integer read counts.

rob-p commented 3 years ago

You mean like cloud services to perform the DE analysis? Itā€™s always possible to round the non-integer counts to the nearest integer. However, reliable abundance estimation tools (e.g. RSEM) have been around long enough now that itā€™s worth pushing any cloud service you might be using to properly deal with these types of inputs. We do differential analysis quite commonly with DESeq2, and salmon -> tximport -> DESeq2 is a quite low-friction solution.

authentic-zz commented 3 years ago

I am also confused about how to use salmon quantification for DeSeq2, because my further use of DeSeq2 is on an online tool, not r. I would be very grateful if someone could answer my doubts.

DustinChen1986 commented 3 years ago

Hi authentic-zz: I have succeeded to use tximport package to transmit salmon results to DESeq2. Now I have 6 salmon results named T11.sf, T12.sf, T13.sfļ¼ŒT21.sfļ¼ŒT22.sfļ¼ŒT23.sf. And I have the transcripts and genes relationship table file trans2gene.csv (each line have one transcript and gene separated by tab). library(DESeq2) library("tximport") library("readr") tx2gene <- read_csv("trans2gene.csv") samples <- data.frame(run = c('T11', 'T12', 'T13', 'T21', 'T22', 'T23'), condition = c("T1","T1","T1","T2","T2","T2")) files <- file.path('.', paste(substring(samples$run, 1,3),".sf", sep = "")) # set salmon result files, each sample names plus '.sf' names(files) <- samples$run txi <- tximport(files, type="salmon", tx2gene=tx2gene) head(txi$counts) dds <- DESeqDataSetFromTximport(txi, colData=samples, design= ~ condition) dds <- DESeq(dds) res <- results(dds) write.table(res,"T1_2.csv", sep = ",", row.names = TRUE)

Hope give you some help.

authentic-zz commented 3 years ago

Hi DustinChen1986ļ¼š Thank you very much for your useful help, but the problem I am currently experiencing is rather special. The data in my hand is quantified by salmon and sorted into a text file (row name is the ID of the gene, and the column name is the sample name), which seems to be different from the file input by tximport package, which is a troublesome thing. If you have a corresponding solution, I would be very grateful.

DustinChen1986 commented 3 years ago

To authentic-zz: You mean you have get the counts into a data frame. I don't know how to transfer the salmon data frame into DESeq2ļ¼Œand I fail to handle data frame from salmon, too. So I have these recommendations to you:

  1. Get the salmon raw results, the sf files, as the tximport input files.
  2. Run the salmon again or choose other pipeline like HISAT2-Stringtie/featurecounts/HTseq.
  3. Try to generate the salmon sf file. The sf file's row name is "Name Length EffectiveLength TPM NumReads". You have the gene ID and counts, so fill the length, effect length, and TPM by tab or something. But I do not sure if this handle is correct. https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats
Yang-amd commented 1 year ago

@DustinChen1986 Hello DustinChen1986 Your codes are clean and easy to read,while I have a problem.How can I create "transcript2Gene.csvā€œ? Thank you for your patient. Yang RT

DustinChen1986 commented 1 year ago

@DustinChen1986 Hello DustinChen1986 Your codes are clean and easy to read,while I have a problem.How can I create "transcript2Gene.csvā€œ? Thank you for your patient. Yang RT

The first column is the transcript's names, the second column is the gene's names, and the header is required

Yang-amd commented 1 year ago

@DustinChen1986 Hello DustinChen1986 Your codes are clean and easy to read,while I have a problem.How can I create "transcript2Gene.csvā€œ? Thank you for your patient. Yang RT

The first column is the transcript's names, the second column is the gene's names, and the header is required

Thank you very much,it helps me a lot! @DustinChen1986

caodudu commented 1 year ago

You mean like cloud services to perform the DE analysis? Itā€™s always possible to round the non-integer counts to the nearest integer. However, reliable abundance estimation tools (e.g. RSEM) have been around long enough now that itā€™s worth pushing any cloud service you might be using to properly deal with these types of inputs. We do differential analysis quite commonly with DESeq2, and salmon -> tximport -> DESeq2 is a quite low-friction solution.

I noticed that now salmon can export the quant.gene.sf file if I add the parameters"-g xx.gtf". What's difference between this file and the result of tximport? Can I use the result to replace tximport?

rob-p commented 1 year ago

It is possible to output gene counts directly, but using tximport is the preferred and officially supported method. The reason for this is that the gene-level aggregation built into salmon is per-sample, that is each sample is aggregated to the gene level independently. On the other hand, tximport considers all samples in an experiment to determine e.g. the average expressed length of a gene over all samples. This leads to better aggregation. Likewise, tximport (specifically via tximeta) provides a nice interface to track and propagate reference annotation provenance, which ultimately leads to more reproducible analyses.