COMBINE-lab / alevin-fry

🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
https://alevin-fry.readthedocs.io
BSD 3-Clause "New" or "Revised" License
165 stars 15 forks source link

Difference in expression of spliced/ambigous transcripts, between old and new versions of Salmon/Alevin-fry #82

Closed biOliver closed 1 year ago

biOliver commented 1 year ago

Hello again.

I originally posted about a "Difference in Agrp expression depending on alignment strategy" (https://github.com/COMBINE-lab/alevin-fry/issues/81)

I closed the issue, since the issue never had anything to do with alignment strategy (Thanks a ton, @rob-p, for the work leading to this conclusion!). And the issue was ultimately solved by upgrading Salmon and Alevin-fry to the newest versions.

But we are still left wondering why we see the discrepancy between spliced counts (depending on which version of Alevin-fry/Salmon is used).


So, I am going to try and pick this up where we left it as good as I can. But first some nomenclature:

Original: versions: Salmon v1.5.0, Alvein-fry v0.3.0 Updated: versions: Salmon v1.9.0, Alvein-fry v0.7.0 U-, S- & A-counts: unspliced, spliced & ambiguous counts


In an e-mail exchange you, @rob-p, proposed the following:

...I believe 0.4.0 is where we added a “prefer spliced gene” heuristic. If a read maps to more than one gene (but not too many), and all but one of the mappings are to unspliced sequence, then we allocate the read to the gene where it mapped to the spliced sequence....

After thinking about it, I do not think that this is the case. E.g. if you plot the total Agrp count across the same cells in the original and updated object you get this: Screenshot 2022-09-20 at 13 08 23 Here you can see that the extra abundance of Agrp counts doesn't come from the unspliced assay, but rather it just appears from nothing.

For scale, here is the same plot, but for accumulated gene counts: Screenshot 2022-09-20 at 13 10 59 Here, it is not as apparent, but it is still clear that we get a portion of S-counts (when going from original to updated) that is too big to be 'just' coming from the U-counts.


Another thing we tested was to plot all individual gene counts against each other (between original and updated object). We did it for three tissues types (brain, adipose & liver), and distinguished between U-, S- & A-counts. Screenshot 2022-09-20 at 13 13 51 Here, you can see that there is a high agreement in the U-counts, but that there seems to be a significant 'bleeding' of expression towards the updated object in A-counts and particular in S-counts.


Given this info, would you have any idea of what why upgrading Salmon/Alevin-fry, in the described manner, could cause this discrepancy? Let me know what you think and if you want me to send more plots (or code) along.

Kind regards Oliver

rob-p commented 1 year ago

Hi @biOliver,

Thank you again for the brilliantly detailed analysis here. So, let me try to be a bit more clear about what I think are the main things that changed between these versions. There are of course several updates here, so there are also other minor changes and those may have some effect as well. The two main things that come to mind are :

  1. The "prefer-spliced" heuristic. This is the one I mentioned in my e-mail. This was present since, I believe 0.4.0 (which corresponds to when the paper was first published), but was absent in 0.3.0. Consider that I have a mapping like the following: { (UMI1, G1-spliced), (UMI1, G2—unspliced) }. That is, this is a gene ambiguous UMI. It maps both to a spliced molecule of G1 and an unspliced molecule of G2. How this gets resolved depends on the UMI resolution mode one is using. If you are using the cr-like resolution mode though (e.g. the one we used throughout the paper), then in v0.3.0 those UMIs are being discarded. They are gene ambiguous, and so they are not assigned to any gene (i.e. we expect the total UMI count in the output matrix will be lower for not including these). However, under the "prefer-spliced" heuristic, this UMI is assigned to G1. Specifically, the rule is that if the UMI maps to only a single spliced molecule (say, belonging to G1) then we ask:
    • Is it splicing ambiguous in G1 (i.e. does it also map to G1-unspliced)?; in that case it is counted as A
    • Does it map to "too many" other unspliced molecules $< \tau$ (I believe the default is 10). If it maps in a spliced manner only to G1 and maps to $< \tau$ other molecules of different genes, it is assigned to G1, otherwise if it maps to $\ge \tau$ other molecules of different genes, it is discarded.

So, this may explain how you see extra counts for this gene appearing from "nowhere" with respect to the 0.3.0 result. Previously, those were being discarded, but in the new version (and back to 0.4) they are being assigned as spliced because they align to the spliced version of this gene and not to too many other molecules. Further, as I mentioned above, the differences you observe here can depend on the resolution mode you are using. While cr-like discards gene-ambiguous reads (initially all of them, and then eventually those that aren't rescued by the "prefer-spliced" heuristic), the cr-like-em algorithm instead considers probabilistically allocating those UMIs among the different places where they map. The confidence of those assignments may not be high if there is not a lot of other evidence to prefer one gene over the others, but at least in that case you'd expect the mass to be more conserved.

  1. The other relevant change is more on the salmon side (and I suspect that this may relate to changes you are seeing in the A matrix). Specifically, between 1.5.2 and 1.9.0 several changes were made to improve sensitivity. Salmon places a cutoff on how many occurrences a k-mer can have and still be used for mapping, as well as the maximum number of mappings an entire read can have and still be reported. This is primarily to prevent super-repetitive regions from slowing down the entire analysis. However, the nature of repetitive sequence in non-coding regions is different than that in coding regions, and so one may want to be a bit more permissive in what they allow. To address this, we implemented a "recovery" procedure. Essentially, we have a default threshold $n_1$ such that if seeds occur $> n_1$ times they are not used. Previously, if all seeds occurred $> n_1$ times, then the read would go unmapped. The recovery procedure instead says, if all seeds occur $> n_1$ times, then let $m = \min( \text{occ}_1, \text{occ}_2, \dots, \text{occ}_j )$ where $\text{occ}_j$ is the number of occurrences of seed $j$. Then, if $m < n_2$ (where $n_2$ is some number much larger than $n_1$) we collect all seeds that occur $m$ times and perform the mapping based on these seeds. That's a lot of technical detail, but the long-and-short of it is that this procedure recovers mappings for reads that map in highly repetitive regions. Instead of those reads simply appearing as unmapped (in the older version) because their seeds matched to too many places, they may map in the newer version. This can also explain why the newer pipeline has more reads to work with.

Together, I think these differences could explain the effects you are seeing. However, I'm happy to discuss any followup questions or comments you may have here. Again, thanks for the wonderfully detailed issue!

biOliver commented 1 year ago

Thank you so much for this explanation!

I see now, that I didn't understand the "prefer-spliced" heuristic the first time around. We now agree that it probably is the main reason for the 'emerging' counts that we see. So far so good. :)

I believe that I saw somewhere [unfortunately, I cannot find where I originally saw it] that the rational for "prefer spliced" was that 'origin of reads are mainly spliced'. But that doesn't hold as an argument for our data, which is mainly (~80%) unspliced.

Have I understood the rationale right? And if so, would it be wrong of us to use "prefer-spliced" when our data is mainly unspliced?

Kind regards Oliver

rob-p commented 1 year ago

Hi @biOliver,

Thanks for the follow-up. I have one more clarification, that may shift the understanding a bit. But first, I think I mentioned the prefer-spliced logic in an e-mail. However, as I mentioned, this idea goes back further and also exists in other workflows / tools. Alex Dobin describes the basic idea in this tweet (https://twitter.com/a_dobin/status/1417921407204438017?s=20&t=tK8iFy8qJoKXKY5399f5gA), with reference to the Ding et al. paper ('20). That rule was later implemented in STARSolo as well.

Now, to the clarification ;P. The rule is actually implemented here : https://github.com/COMBINE-lab/alevin-fry/blob/master/src/utils.rs#L296. Specifically, the code the implements the "prefer spliced" rule is this: https://github.com/COMBINE-lab/alevin-fry/blob/2ebfaa95bac8badf328db1f5c9eaff183bb70b9c/src/utils.rs#L343-L373

I recognize you may not be a rust native, but hopefully the comments clarify a bit more what is happening. Specifically, though I've called the heuristic "prefer spliced", it will look for the gene that has a spliced transcript mapping this UMI. However, if the UMI is ambiguous within this gene (i.e. it maps to both a spliced and unspliced transcript of this gene), then the UMI itself will be assigned as ambiguous for this gene. So the UMI has been "rescued" (it's no longer discarded), but it is assigned as an ambiguous count for the gene for which it is rescued. Of course, if it is not ambiguous in this gene (maps only to the spliced transcript), then it will be assigned as spliced.

The "prefer-spliced" heuristic is, of course, still a heuristic. And so, I agree, it may not be optimal in all cases. However, in cases such as this, where the UMI is otherwise gene-ambiguous, such UMIs would instead simply be discarded under e.g. the cr-like UMI resolution policy. The other way to deal with such cases is to instead use a resolution policy like cr-like-em or parsimony-em that will allow UMIs to map to multiple genes, but then assign them probabilistically after all other UMIs in the cell have been allocated. However, it is worth mentioning that a good probabilistic model for tagged-end single-cell RNA-seq is still a topic of active research, since the very low coverage per-cell means that the EM algorithm has relatively little other information from which to make an informed decision — much of the time this means that the multimapping UMIs may just be split evenly among the multi-mapping locations. This, of course, isn't just an challenge for alevin-fry, but for all existing single-cell quantification methods.

In such cases, which method to prefer (how to handle such UMIs) is really up to the end user based on their expectations and the analysis they wish to perform. However, I'll mention that these cases are exactly the types of things we're interested in, since it's one of the areas where there is the most room to make substantial progress with improved methods. So, we'd be happy to follow up further or help dig in to this data in more detail; perhaps it will help suggest a fundamentally better approach to such cases.

Best, Rob

biOliver commented 1 year ago

Thank you a lot for the detailed explanation. :)

I have looked at the tweet and through the Ding et al. paper (and eventually the STARsolo preprint) and I am not able to find any mention of any prefer spliced heuristic/rational. Would you maybe be able to clarify where in e.g. Ding et al. I could find any mention of and/or justification for it?

Kind regards Oliver

rob-p commented 1 year ago

Hi @biOliver,

Sure; the relevant text in the paper is:

Annotating each alignment with a gene tag We use featureCounts40 from the Subread package, v.1.6.2, to add a gene tag to each alignment. To count reads overlapping with introns for single-nucleus RNA-seq data, we used a two-step approach to first count the reads overlapping with exons. In the second step, the reads not overlapping with exons were recounted if they overlapped with introns. We only included reads aligning in the sense orientation with the genome annotation, except for Smart-seq2, which does not generate strand-specific data.

They don’t specifically call it the “prefer spliced” heuristic, but this is what Alex is referring to and that is what this rule is explicitly doing. So they first look to assign a UMI to any exon it overlaps. Only in the case that it fails to overlap an exon (i.e. if there is no “spliced” alignment), do they consider annotating it with an intron.

Best, Rob

biOliver commented 1 year ago

Ah, got it. Thank you.

You have answered all the questions I had regarding this heuristic. And so I will close this issue.

Thank you a lot Rob, again. :)

Kind regards Oliver