Adding/removing transcripts to/from run drastically changes quantification of other transcripts

Hello,

I am currently using Salmon v1.2.0 which was build from source and is run on Ubuntu Linux

My objective is to obtain the proportions of different alternative spliced transcripts, which I am doing via quant and using the TPMs to get the proportions of 'primary' transcript versus 'spliced' transcript. I am currently working with 5 genes (Genes A-E) from Soybean (Glycine max Wm82.a2.v1), to which I also created decoys using DecoyTranscriptome.sh and the soybean genome and gtf. In addition, I am only using trimmed paired-end reads of 100bp.

An example of a concern (See Table below) I have is that when I run Genes A, B, C, and D I get a certain set a results. However, when I run all the genes together (Genes A-E), the results from Gene C and Gene D are drastically different, but this is only under certain parameters (red highlight). If I alter the parameters and compare the results of the 4 gene run and the 5 gene run the results are comparable.

Param 1 = -- validateMappings --gcBias --recoverOrphans --softclipOverhangs --allowDovetail --posBias --scoreExp 0.0 --consensusSlack 0.95 --rangeFactorizationBins 5 --numBootstraps 1000 --bootstrapReproject --minScoreFraction 0.9

Param 2 = -- validateMappings --gcBias --scoreExp 1.0 --consensusSlack 0.2 --rangeFactorizationBins 4 --numBootstraps 1000 --bootstrapReproject --minScoreFraction 0.85

I am a trying to understand how the addition of one gene into the run would affect the results of other genes so much and causes concern to me as results can differ so much with the removal or addition of even one gene. It almost appears that the proportions are just switched for first run, except the V1 trifolate dataset of Gene C which showed at least similar proportions between runs with Param 1 (yellow highlight).

Capture

Please let me know if you need to see additional files, I just wanted to summarize what I was seeing here. I didn't receive any errors during these runs, just the results of Gene C and Gene D varying for a reason I cannot pinpoint.

Hi @PlantDr430,

Thanks for the detailed report. I have a general thought, and then a more specific thought given your use case and parameter settings.

The general thought is that it is not true, broadly, that one should expect transcript (or even gene) abundances to stay the same under a change of annotation. The estimates computed by salmon (and by all transcript-level abundance estimation tools) is one that maximizes the likelihood of the data (or maximizes the ELBO in the case of VI) conditioned on the observed fragments and the transcripts. When one changes the transcripts, they change the variable upon which the inference is conditioned, and the results, in general, can change (a lot, or a little bit). This is specifically most prone to happen when transcripts / genes are added to the annotation that are similar to other transcripts or genes in the annotation.

Now, my specific thought based on your settings of parameters. They are quite different, but the three big factors I see here are (1) the setting for --scoreExp, (2) the setting(s) for dovetail and softclipOverhangs and (3) the setting for --consensusSlack Why are they a big deal?

--scoreExp determines how much we down-weight scores sub-optimal alignments. Setting --scoreExp to 0 says that a sub-optimal alignment, at least in terms of the alignment probability is just as good as the optimal alignment. So, imagine you had a few read length regions of a pair of genes that each differed by 1 or 2 SNPs. When --scoreExp is 0, then the model considers alignments (say to transcript 2) with 2 substitutions to be just as likely as alignments (say to transcript 1) that are perfect (with no substitutions). While you can play around with different values of --scoreExp to determine how differences from the optimal alignment should be weighted, I'd strongly suggest against setting --scoreExp equal to 0.
- --allowDovetail and --softclipOverhangs may or may not have a significant effect based on the quality of your library and annotation. Ideally, you would have no dovetailed mappings and no reads overhanging annotated transcripts. However, if you have an incomplete assembly or a library of questionable quality, these can both occur in practice. The meta_info.json file in the aux_info directory will give you stats about the number of dovetailed reads, so you can see if this is likely to have an effect here or not.
--consensusSlack determines which reads pass through the mapping-based filtering based on their chaining score and are therefore subject to alignment validation. While the chaining score is a decent proxy for alignment score, it's not perfect (otherwise, we would not really waste compute cycles computing the optimal alignment). Setting the --consensusSlack to a large value allows many things to be subject to mapping validation, while setting it to a smaller value doesn't. If the value is too small, then you may see situations where the mapping that yields the optimal alignment doesn't have a chance to be counted because its chain score is too low. Now, it is true that the only reads used will be those surpassing --minScoreFraction in terms of their alignment score, so changing the --consensusSlack won't allow through poor alignments, but if you set it too conservatively, you might miss some mappings that could have yielded the optimal alignment to mappings that are sub-optimal (personally, though, I'd guess this flag is probably the least likely to be having an effect here).

So, where to go from here? I'd try these flags 1-by-1, roughly in the order I listed them above, to see which ones are having the most drastic effect on your result. Also, if you'd be willing to share the meta_info.json files in each quantification directory, I'd be happy to look into them and see if anything else looks strange under either of these parameter settings. Let me know if you have any questions on my explanations above.

Thanks @rob-p,

Your explanations are helpful, and I think it may my concern may just be more associated with your general thought as I've tested this with multiple parameters. The one I represented here was just an example, but I also can see how the parameters can be affecting these results. It was just strange to see such a huge shift with the addition/removal of one gene, which makes me think it more associated with how the inference of the variables are conditioned.

As for providing the meta_info.json files, I currently have thousands of them as I am running triplicates of ~150 parameter combinations for multiple tissue types and stages. In the end I don't think it will be necessary as we will likely be changing our approach a bit, which should be fine with the system I have in place.

Also, as for --scoreExp our main goal is to try and use Salmon to get quantification of individual genes (primary versus spliced forms). From my analysis, it appears that some genes perform better with scores > 0, however, some genes do perform better with a --scoreExp of 0. Although, this could be a factor in running Salmon with such a narrow view (i.e. two transcripts and some housekeeping genes) and might not be the case as more genes are added to the run.

Hi @PlantDr430,

Thanks for the context! As always, we'd be interested in learning anything interesting you find about the general behavior of salmon in different contexts and with different parameter settings etc.

Out of curiosity, when you mention that genes perform "better" with one or another --scoreExp, is it the case that this is data where you have some sort of ground truth expectation for the abundance of the primary vs. spliced forms? If so, super interesting!

One other thought I had about this. While it is true, as I mentioned in my original post, that the conditioning on the transcripts is fundamental in the case of salmon and other transcript expression tools that don't, themselves, try to assemble new transcripts, it's not necessarily true that there is no evidence in the quantifications that something my be awry. Specifically, I noticed that you are using posterior confidence estimation (bootstrapping). We actually have a recent paper that discusses how to use the uncertainty estimates from salmon (though we rely on the Gibbs sampler rather than bootstrapping) to group together transcripts whose abundances cannot be individually estimated with confidence (with evidenced provided by the posterior samples). It might be useful to identify such cases in your analysis.

Let me know if there's any other way I can help!

Hi @rob-p,

As for our main goal, we are looking at the precision of Salmon versus biological PCR expectations of abundances. However, it is on a small number of genes as there have been some initial challenges with using "accurate" transcripts versus "computational predicted transcripts". Anyway, we hope to have this paper ready this year, but again I think we need to look at it from a different angle. While this project is a bit of a side project, my hope it that it will accomplish our initial goal and to provide the community with the general behavior of Salmon with different spliced types (i.e. Exon skipping, IR, AA, AD, etc). Again, it'll be on a small set of genes, but still should be interesting and hopeful can improve the future use of Salmon. I'll make sure to let you know when it is available.

COMBINE-lab / salmon

Adding/removing transcripts to/from run drastically changes quantification of other transcripts #523