Open thkitapci opened 4 years ago
Hi Hamdi,
http://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html has a really great description of how to use the output from Salmon with DESeq2.
Hi @roryk I know about tximport but is there any way to generate the input data for DESeq2 without using R? I am processing the data on one platform and then transfer to another platform for R/DESeq2 analysis. I would like to be able to generate the output of the first part (salmon) without using an R library.
If it is not possible and I have run R to get the count matrix for DEseq2, I can figure out a way to do it.
DESeq2 input file is a simple matrix of counts and "salmon quantmerge" already generates this, can you please explain to me why an external library is required ? Is there something I am missing that tximport package is doing to the data? Does tximport takes into account gene lengths or library size to generate the output?
Thanks Best Regards Hamdi
Hi Hamdi
Bit confused about your logic here - why would you not want to use tximport in R when your next step (DESeq2) is still going to be R?
I am curious to know your reasoning
I am processing the data on one platform and then transfer to another platform for R/DESeq2 analysis. I would like to be able to generate the output of the first part (salmon) without using an R library.
This is a really torturous step for none R knowledge to use tximport for data transmission to DESeq2. Main problem is that the count matrix for DESeq2, which can be easily prepared by Python, must be integer, but tximport did not explain how to deal with the decimals.
I also wonder why salmon not output original reads counts
update:
Because it is more accurate.
DeSeq2 should accept it.
As for now, maybe we could simply round to the nearest integer
NumReads ā This is salmonās estimate of the number of reads mapping to each transcript that was quantified. It is an āestimateā insofar as it is the expected number of reads that have originated from each transcript given the structure of the uniquely mapping and multi-mapping reads and the relative abundance estimates for each transcript.
If you can already use DESeq2, then using tximport should not make it any harder at all. Given the tximport data, getting it into DESeq2 is as easy as
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)
as shown in the tximport vignette.
Regarding outputting "original read counts"; salmon does output the estimates for the number of reads deriving from each transcript. If the question is, why is this number not an integer, that's because the best estimate (the maximum likelihood estimate) is often not integral. Tools that simply count reads (e.g. HTSeq) produce integer counts, but these are in no way "original read counts" for the corresponding genes, and are usually less accurate (farther from the true number of fragments deriving from a transcript / gene) than the estimates produced by salmon. The fact that the best estimate is often not an integer is a direct result of the fact one is considering a statistical model and taking expectations.
Still maybe it's better to have an integer version of read counts file. There are cloud services and they do not accept non-integer read counts.
You mean like cloud services to perform the DE analysis? Itās always possible to round the non-integer counts to the nearest integer. However, reliable abundance estimation tools (e.g. RSEM) have been around long enough now that itās worth pushing any cloud service you might be using to properly deal with these types of inputs. We do differential analysis quite commonly with DESeq2, and salmon -> tximport -> DESeq2 is a quite low-friction solution.
I am also confused about how to use salmon quantification for DeSeq2, because my further use of DeSeq2 is on an online tool, not r. I would be very grateful if someone could answer my doubts.
Hi authentic-zz: I have succeeded to use tximport package to transmit salmon results to DESeq2. Now I have 6 salmon results named T11.sf, T12.sf, T13.sfļ¼T21.sfļ¼T22.sfļ¼T23.sf. And I have the transcripts and genes relationship table file trans2gene.csv (each line have one transcript and gene separated by tab). library(DESeq2) library("tximport") library("readr") tx2gene <- read_csv("trans2gene.csv") samples <- data.frame(run = c('T11', 'T12', 'T13', 'T21', 'T22', 'T23'), condition = c("T1","T1","T1","T2","T2","T2")) files <- file.path('.', paste(substring(samples$run, 1,3),".sf", sep = "")) # set salmon result files, each sample names plus '.sf' names(files) <- samples$run txi <- tximport(files, type="salmon", tx2gene=tx2gene) head(txi$counts) dds <- DESeqDataSetFromTximport(txi, colData=samples, design= ~ condition) dds <- DESeq(dds) res <- results(dds) write.table(res,"T1_2.csv", sep = ",", row.names = TRUE)
Hope give you some help.
Hi DustinChen1986ļ¼ Thank you very much for your useful help, but the problem I am currently experiencing is rather special. The data in my hand is quantified by salmon and sorted into a text file (row name is the ID of the gene, and the column name is the sample name), which seems to be different from the file input by tximport package, which is a troublesome thing. If you have a corresponding solution, I would be very grateful.
To authentic-zz: You mean you have get the counts into a data frame. I don't know how to transfer the salmon data frame into DESeq2ļ¼and I fail to handle data frame from salmon, too. So I have these recommendations to you:
@DustinChen1986 Hello DustinChen1986 Your codes are clean and easy to read,while I have a problem.How can I create "transcript2Gene.csvā? Thank you for your patient. Yang RT
@DustinChen1986 Hello DustinChen1986 Your codes are clean and easy to read,while I have a problem.How can I create "transcript2Gene.csvā? Thank you for your patient. Yang RT
The first column is the transcript's names, the second column is the gene's names, and the header is required
@DustinChen1986 Hello DustinChen1986 Your codes are clean and easy to read,while I have a problem.How can I create "transcript2Gene.csvā? Thank you for your patient. Yang RT
The first column is the transcript's names, the second column is the gene's names, and the header is required
Thank you very much,it helps me a lot! @DustinChen1986
You mean like cloud services to perform the DE analysis? Itās always possible to round the non-integer counts to the nearest integer. However, reliable abundance estimation tools (e.g. RSEM) have been around long enough now that itās worth pushing any cloud service you might be using to properly deal with these types of inputs. We do differential analysis quite commonly with DESeq2, and salmon -> tximport -> DESeq2 is a quite low-friction solution.
I noticed that now salmon can export the quant.gene.sf file if I add the parameters"-g xx.gtf". What's difference between this file and the result of tximport? Can I use the result to replace tximport?
It is possible to output gene counts directly, but using tximport
is the preferred and officially supported method. The reason for this is that the gene-level aggregation built into salmon is per-sample, that is each sample is aggregated to the gene level independently. On the other hand, tximport
considers all samples in an experiment to determine e.g. the average expressed length of a gene over all samples. This leads to better aggregation. Likewise, tximport
(specifically via tximeta
) provides a nice interface to track and propagate reference annotation provenance, which ultimately leads to more reproducible analyses.
Hi, I run salmon then used quantmerge to combine the results as
salmon quantmerge --quants
cat list_of_quant_folders--column numreads -o Merged_quants.txt
Can I use this as input for the DeSeq2 analysis ? One problem is "numreads" column is not integer and DeSeq2 requires integer input (read counts) can I convert the numbers in this column to integer then use as DeSeq2 input?
Thanks Hamdi