COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
769 stars 161 forks source link

Salmon read number discrepancy across identical runs #613

Open rmurray2 opened 3 years ago

rmurray2 commented 3 years ago

We noticed recently that doing quantification multiple times (using exact same settings) on the same file using salmon v0.99.0 resulted in some transcripts having different read number values (NumReads column) across different runs.

What seems to happen is that for the transcripts that differ across runs, one value will be zero, and the other will be non-zero (I recall seeing ~30 and ~75). We only looked at one biological replicate, but didn't see any examples in which multiple runs would produce more than two values (it's either zero or another number, but never another number). The total number of transcripts for which this was happening was fewer than 10.

We tried specifying that the quantification be done on one CPU core, thinking that perhaps the discrepancy was coming from multithreading somehow, but we observed the same phenomenon.

salmon quant -i salmon_index_noThread_2 -l A -r input_file.fq.gz -g Mus_musculus.GRCm38.100.gtf --validateMappings -o out/fq_quant

rob-p commented 3 years ago

Hi @rmurray2,

Thank you for the report. First, I just want to mention that I don't believe v0.99.0 to be an officially released version number. That is, there was a v0.14.x and a (released in source only v0.15.0), and then the versions moved to 1.0.0 and beyond.

However, this behavior certainly isn't related to that. There are 2 things going on that can lead to this effect. The first one, which is relatively easy to test, is that there may be small changes in when the inferred library type starts to be enforced (if it is not IU) when auto type detection is used (see this issue and comments therein).

The second and more fundamental thing going on is that the observed behavior is intended. Even with a single thread of execution provided to salmon for mapping and quantification, there is a separate background thread that simply consumes reads from file and puts them in memory for quantification, and while e.g. pairing information between files is guaranteed to be preserved, exact read order is not. This can lead to differences in the order in which reads are processed and, as a result, differences in the initialization conditions of the optimization. The ultimate result is that for transcripts that have large inferential uncertainty, different numbers of reads can be assigned between runs. We have thought a lot about this behavior, what it means, and how the NumRead values should best be communicated to users. At the end of the day, the NumReads constitute the expected value of latent variables inferred in a very high-dimensional space (# of parameters is at least the number of transcripts). Therefore, there are certain transcripts, whose estimated number of reads simply have tremendous inferential uncertainty — and small perturbations in the initial conditions of the optimization will lead to different estimated values for their abundances. For those transcripts where you observe such fluctuations between runs, this is simply evidence that the precision that can be confidently placed on those estimates is below the degree of variation you observe. Treating these transcripts in downstream analysis as more certain can easily lead to spurious inferences regarding things like differential transcript expression or usage.

One can make an argument for trying to provide a way to enforce removal of this variation (which, granted, would be a challenge). However, the reason we decided against even attempting this is because it doesn't properly address any issue with respect to an actual biological analysis. That is, even if you could fix, precisely, the update order and initialization conditions for a specific sample to eliminate any variation between runs, almost all experiments consist of multiple samples. In other samples, the same transcript fractions could give rise to a slightly different set of observed fragments that induce exactly the same type of variation under uncertainty; and since that uncertainty is baked into the sample, it cannot and should not be removed. Having exact replication of a sample at a numerical threshold below the inferential uncertainty for a transcript conveys false confidence in the precision of the estimate. This is why, for transcript-level analysis, we highly recommend having salmon produce posterior gibbs samples (with the --numGibbsSamples flag). This will draw samples from the posterior distribution over the abundance estimates and allow determination of what inferences can be made robustly and what cannot. We have spent a good deal of time thinking about how to properly perform statistical inference on these uncertain quantities, and so I'd point you at swish, which is a tool for differential analysis at the transcript level that makes uses of a non-parametric test over the inferential replicates (Gibbs samples) to incorporate uncertainty into the differential analysis. We also developed a tool terminus that makes use of the Gibbs samples and point estimates of salmon to group together transcripts whose individual abundances cannot be reliably inferred given the fragments in the sample. While the best way to properly assess, propagate and handle uncertainty in transcript-level inference is still, in my opinion, an active area of research in the field, these are some solutions we've come up with to address this challenge so far. And while, as a computer scientist myself, I certainly appreciate the desire to be able to have e.g. exactly the same numerical output for a particular sample, we feel that doing so might convey a false sense of certainty in the resulting estimates (and it would also be very difficult to do, technically, given the streaming asynchronous phase of the method). This also means, of course, that you should be wary of the precision between runs even for methods that produce their estimates in 100% deterministic ways (e.g. RSEM, etc.); you may get identical or near identical numbers, but without an estimate of the uncertainty, that precision can be artifactual and potentially misleading.