COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
751 stars 159 forks source link

Adding/removing transcripts to/from run drastically changes quantification of other transcripts #523

Open PlantDr430 opened 4 years ago

PlantDr430 commented 4 years ago

Hello,

I am currently using Salmon v1.2.0 which was build from source and is run on Ubuntu Linux

My objective is to obtain the proportions of different alternative spliced transcripts, which I am doing via quant and using the TPMs to get the proportions of 'primary' transcript versus 'spliced' transcript. I am currently working with 5 genes (Genes A-E) from Soybean (Glycine max Wm82.a2.v1), to which I also created decoys using DecoyTranscriptome.sh and the soybean genome and gtf. In addition, I am only using trimmed paired-end reads of 100bp.

An example of a concern (See Table below) I have is that when I run Genes A, B, C, and D I get a certain set a results. However, when I run all the genes together (Genes A-E), the results from Gene C and Gene D are drastically different, but this is only under certain parameters (red highlight). If I alter the parameters and compare the results of the 4 gene run and the 5 gene run the results are comparable.

Param 1 = -- validateMappings --gcBias --recoverOrphans --softclipOverhangs --allowDovetail --posBias --scoreExp 0.0 --consensusSlack 0.95 --rangeFactorizationBins 5 --numBootstraps 1000 --bootstrapReproject --minScoreFraction 0.9

Param 2 = -- validateMappings --gcBias --scoreExp 1.0 --consensusSlack 0.2 --rangeFactorizationBins 4 --numBootstraps 1000 --bootstrapReproject --minScoreFraction 0.85

I am a trying to understand how the addition of one gene into the run would affect the results of other genes so much and causes concern to me as results can differ so much with the removal or addition of even one gene. It almost appears that the proportions are just switched for first run, except the V1 trifolate dataset of Gene C which showed at least similar proportions between runs with Param 1 (yellow highlight).

Capture

Please let me know if you need to see additional files, I just wanted to summarize what I was seeing here. I didn't receive any errors during these runs, just the results of Gene C and Gene D varying for a reason I cannot pinpoint.

rob-p commented 4 years ago

Hi @PlantDr430,

Thanks for the detailed report. I have a general thought, and then a more specific thought given your use case and parameter settings.

The general thought is that it is not true, broadly, that one should expect transcript (or even gene) abundances to stay the same under a change of annotation. The estimates computed by salmon (and by all transcript-level abundance estimation tools) is one that maximizes the likelihood of the data (or maximizes the ELBO in the case of VI) conditioned on the observed fragments and the transcripts. When one changes the transcripts, they change the variable upon which the inference is conditioned, and the results, in general, can change (a lot, or a little bit). This is specifically most prone to happen when transcripts / genes are added to the annotation that are similar to other transcripts or genes in the annotation.

Now, my specific thought based on your settings of parameters. They are quite different, but the three big factors I see here are (1) the setting for --scoreExp, (2) the setting(s) for dovetail and softclipOverhangs and (3) the setting for --consensusSlack Why are they a big deal?

So, where to go from here? I'd try these flags 1-by-1, roughly in the order I listed them above, to see which ones are having the most drastic effect on your result. Also, if you'd be willing to share the meta_info.json files in each quantification directory, I'd be happy to look into them and see if anything else looks strange under either of these parameter settings. Let me know if you have any questions on my explanations above.

PlantDr430 commented 4 years ago

Thanks @rob-p,

Your explanations are helpful, and I think it may my concern may just be more associated with your general thought as I've tested this with multiple parameters. The one I represented here was just an example, but I also can see how the parameters can be affecting these results. It was just strange to see such a huge shift with the addition/removal of one gene, which makes me think it more associated with how the inference of the variables are conditioned.

As for providing the meta_info.json files, I currently have thousands of them as I am running triplicates of ~150 parameter combinations for multiple tissue types and stages. In the end I don't think it will be necessary as we will likely be changing our approach a bit, which should be fine with the system I have in place.

Also, as for --scoreExp our main goal is to try and use Salmon to get quantification of individual genes (primary versus spliced forms). From my analysis, it appears that some genes perform better with scores > 0, however, some genes do perform better with a --scoreExp of 0. Although, this could be a factor in running Salmon with such a narrow view (i.e. two transcripts and some housekeeping genes) and might not be the case as more genes are added to the run.

rob-p commented 4 years ago

Hi @PlantDr430,

Thanks for the context! As always, we'd be interested in learning anything interesting you find about the general behavior of salmon in different contexts and with different parameter settings etc.

Out of curiosity, when you mention that genes perform "better" with one or another --scoreExp, is it the case that this is data where you have some sort of ground truth expectation for the abundance of the primary vs. spliced forms? If so, super interesting!

One other thought I had about this. While it is true, as I mentioned in my original post, that the conditioning on the transcripts is fundamental in the case of salmon and other transcript expression tools that don't, themselves, try to assemble new transcripts, it's not necessarily true that there is no evidence in the quantifications that something my be awry. Specifically, I noticed that you are using posterior confidence estimation (bootstrapping). We actually have a recent paper that discusses how to use the uncertainty estimates from salmon (though we rely on the Gibbs sampler rather than bootstrapping) to group together transcripts whose abundances cannot be individually estimated with confidence (with evidenced provided by the posterior samples). It might be useful to identify such cases in your analysis.

Let me know if there's any other way I can help!

PlantDr430 commented 4 years ago

Hi @rob-p,

As for our main goal, we are looking at the precision of Salmon versus biological PCR expectations of abundances. However, it is on a small number of genes as there have been some initial challenges with using "accurate" transcripts versus "computational predicted transcripts". Anyway, we hope to have this paper ready this year, but again I think we need to look at it from a different angle. While this project is a bit of a side project, my hope it that it will accomplish our initial goal and to provide the community with the general behavior of Salmon with different spliced types (i.e. Exon skipping, IR, AA, AD, etc). Again, it'll be on a small set of genes, but still should be interesting and hopeful can improve the future use of Salmon. I'll make sure to let you know when it is available.