COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
780 stars 165 forks source link

reproducibility issue #102

Closed ramezrawas closed 8 years ago

ramezrawas commented 8 years ago

I am trying to test salmon ver 0.7.2 using commandline:

salmon index -t test-data/transcripts.fasta -i output --type quasi -k 31 --threads 4 --sasamp 1 salmon quant -l IU -index output -1 test-data/reads_1.fastq -2 test-data/reads_2.fastq -o output_quant

the .sf file is not reproducible, any clue???

rob-p commented 8 years ago

Hi @ramezrawas,

Can you say precisely what you mean by reproducible? Do you mean that the values in the .sf file are not identical? If so, this is expected behavior. It exists for a number of reasons. The big one is that the initial phase of salmon uses an online inference algorithm so that specific details of the solution are dependent on the order in which the reads are processed (which is random given that multiple threads parse reads and update estimates asynchronously). However, the more important point here is that the inference estimates returned by Salmon (and, for that matter, every other transcript-level expression tool) are the result of a statistical optimization procedure that cannot guarantee a unique global optimal solution (and, in fact, even if a global optimum could be guaranteed, there may be multiple different optima). Thus, there is uncertainty inherent in the statistical problem being solved. Of course, if one ordered updates in the same way and set up the initial conditions precisely the same, there would be convergence to the same result, but any sense of confidence there is illusory.

However, Salmon does provide a way to quantify, statistically, confidence in the result. The --numBootstraps option will do bootstrap sampling, or the --numGibbsSamples option will perform posterior Gibbs sampling. Both of these techniques will provide samples from the posterior distribution, and the variance of these samples will give you some information about the variance in the results that are due purely to the inherent statistical uncertainty in the problem. In the scripts folder there is a python script ConvertBootstrapsToTSV.py that will convert either the bootstrap or gibbs samples to a easily readable tsv format. These samples represent the estimated number of reads coming from each transcript when sampling from the posterior. These can be used to empirically estimate that statistical uncertainty in the abundance estimates of the different transcripts.

Finally, I'll note that while, for the reasons described above, the output is not purely deterministic. The difference between subsequent runs of salmon (with differences changing depending on the order in which reads are parsed and processed) is typically small (and much smaller than the statistical uncertainty in the abundance estimates of the transcripts). I'd be happy to answer any other questions you have.

--Rob

ramezrawas commented 8 years ago

Hi @rob-p,

Thank you very much for the well-explanation :)

nh13 commented 5 months ago

@rob-p we're looking to use Salmon but would like it to be deterministic. Is there a way to have a deterministic way that threads could chunk the reads? A. la. bwa? Any other issues?

rob-p commented 5 months ago

Hi @nh13,

This is not something for which we currently have support or something that we currently plan. I'd be open to it, but I'm honestly not sure how to cleanly do it in the current architecture, and doing so would certainly incur a performance hit. Salmon runs 2 phases of inference; and online phase and an offline phase. The online phase has access to fragment-level information that is then summarized away during the offline phase (like the specific locations of each read, the length of each observed fragment, etc.). That information goes away when the reads are summarized into range-factorized equivalence classes. Moreover, some of the model parameters learned during the online phase will depend (in their details) on the order in which observations are made. Ostensibly, observing the same data in the same order and issuing updates to shared model parameters from worker threads in the same order should result in identical values, however this has never been tested and was never a design goal.

The reason for this is that differences between runs are within the bounds of the inherent inferential uncertainty of the estimated parameters anyway. That is, if one is relying on a specific value at a level of precision such that a different run of salmon would produce a value different enough to change a downstream analysis, then one is imparting more precision on the estimates than they can provide. Other methods that produce identical results between runs for these values may produce the same output, but the accuracy of the output at that level shouldn't be trusted in this case. The uncertainty of the parameter estimates can be evaluated based on the Gibb samples (or bootstrap replicates) that salmon computes. Of course, the small differences between runs rarely lead to differences in downstream analysis (almost certainly at the gene level and also at the transcript level if you use a differential testing method that is aware of inferential uncertainty).

On the other hand, if you are in need of something that produces identical output between runs (but that still also lets you assess uncertainty), then you can give piscem => piscem-infer a try. That tool already works well, but it is in active development and we'd certainly be happy to help build features that you and others might find useful, and would be happy to chat about that if you like.