COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
769 stars 161 forks source link

Potential bug with Gibbs and questions on bootstraps #466

Open PlantDr430 opened 4 years ago

PlantDr430 commented 4 years ago

Hello,

I am using the recent downloaded executable for v1.1.0 and am using salmon (bulk-mode). Noticed one potential bug and have some general questions regarding bootstraps.

  1. The potential bug I noticed was when I use --numGibbsSamples the logs/salmon_quant.log file is always blank. When I remove this flag and re-run the program the log file is correctly printed out.

  2. Regarding Bootstraps:

I've been working with parameters to min/max my predicted estimates to quantified cDNA results that we have. Through this process I was not performing bootstraps and was just using the TPM results that were located within the quants.sf file and have been getting some good results, with R-squared values of ~0.98 for actual v. predicted plots. As a note, even after running hundreds of runs with the same parameters, the TPM values in the quants.sf file never really fluctuated that much between runs and were generally nearly identical.

However, I thought it would be best to bootstrap --numBootstraps each Salmon run and average the bootstraps to get more accurate results. After doing 1,000 bootstraps per sample I noticed that the TPM values I calculated from numbers of mapped reads in the bootstraps.gz fluctuated a lot more and overall brought my R-squared values down to ~0.87. (I used your ConvertBootstrapsToTSV.py script to get the read counts from the bootstrap file and then calculated the TPMs using the effective lengths from the quants.sf file. As looking through previous issues (#246) I was under the assumption that the bootstrap file only contained new mapped read estimates and the effective lengths should be the same for all bootstrap runs.)

My question is why do the TPM values in the quants.sf file not fluctuate as much (even after 100+ runs using the same parameters), while calculated TPM values from bootstraps of the same run are showing greater variance?

rob-p commented 4 years ago

Hi @PlantDr430,

Thanks for the detailed report. I'll provjde a detailed explanation of the bootstrap variance later when I'm at my computer. However, regarding the gibbs issue with the log file not being properly populated; you still get the appropriate gibbs samples as output, right? The issue is just with writing the log? We'll check into what might be causing this.

PlantDr430 commented 4 years ago

Thanks, for the response and I'll wait for the detail explanation.

As for the Gibbs, yes I get the appropriate gibbs sample outputs, it's just appears that the log file is not being written.

rob-p commented 4 years ago

Hi Stephen,

So, the variation you see when you re-run salmon multiple times is expected to be different (and much smaller) than the variance you see when bootstrapping. Why is this? When you re-run salmon, the only variance you are seeing is due to small differences in the order of observations / updates from the streaming collapsed variational Bayes phase of the algorithm. This, in turn can have a slight effect on the initialization conditions of the offline phase of the algorithm, and some of the parameters learned for the auxiliary parameters. However, in each run, you are observing exactly the same set of reads and salmon is producing exactly the same set of alignments; only the order and therefore some of the streaming updates change. So, we expect the final estimated abundances to be very similar to each other.

However, when salmon performs bootstrapping, it is actually resampling with replacement, from the counts of the range-factorized equivalence classes. Roughly, we expect this resampling to be similar to if we re-sampled with replacement from the original set of input reads. That is, we are re-sampling from our population sample — the observed set of reads — to estimate the variance due to inference. So, for the bootstrap re-samplings, we expect significantly more variance than between subsequent runs of salmon, because the observations from which we are making the inference are actually changing. It is possible e.g. that some uniquely mapped reads may not be chosen in some bootstrap sample (since we are re-sampling the observed read count, but doing so with replacement), and so the estimates of sets of related isoforms will change in those samples. Thus, since the observations themselves are changing, we expect the estimates to display greater variance. In fact, this is the main goal of performing the bootstrapping (or Gibbs sampling) — to estimate the uncertainty due to inference if we had observed many reads coming from the same underlying "population" as the ones we have in our specific sample, but subject to the random sampling effect induced by sequencing and all of the subsequent downstream effects it has on our estimator (i.e. salmon's computational procedure for estimating transcript abundance via the variational Bayesian optimization algorithm).

From the practical perspective, one would not necessarily expect taking the average of the bootstrap estimates to produce a more accurate point estimate than taking the normal point estimate produced by salmon. The main purpose of performing the Gibbs sampling or bootstrapping is to allow accurate assessment of the posterior variance of the point estimates (to build things like credible intervals). The mean of the bootstrap estimates should be highly-correlated with the normal point estimates, but I wouldn't expect it to be identical. Also, you might try seeing what you get with a different summary statistic, like the median. However, the main point of producing this information is to allow you to assess the posterior variance, and also to pass these samples to uncertainty-aware differential analysis tools, like swish, downstream of salmon.

Anyway, thanks again for the detailed report! We'll look into the logging issue, and please let me know if my description above answers your question.

PlantDr430 commented 4 years ago

Sorry for the delay. Thank you for the explanation, it is very detailed and definitely answers my questions / points me in a better direction.