Difference in Agrp expression depending on alignment strategy

biOliver commented 2 years ago

Hello Alevin-Fry team, and thank you for you cool tool!

I have a question about Alevin-Fry, which I use in my daily work.

There is a discrepancy between the expression of some genes depending on if you run Alevin-Fry in regular pseudoalignment mode or in selective alignment mode.

We saw this, most notably, with the gene Agrp in snRNAseq mouse brain tissue.

Agrp cells are usually clustering together [in UMAP space], but we noticed that Agrp seemed quite dispersed when running selective alignment, compared to regular pseudoalignment on the exact same data:

Selective alignment: argp_featureplot_selective-alignment argp_vlnplot_selective-alignment

Pseudoalignment: argp_featureplot_pseudoalignment argp_vlnplot_pseudoalignment

I have not been able to make sense of where this deficit in Agrp expression would arise from reading your docs/tutorials, so I really hope that you can help shed some light on this for/with me.

Selective alignment was run according to https://combine-lab.github.io/alevin-fry-tutorials/2021/improving-txome-specificity/ and pseudoalignment was run according to https://combine-lab.github.io/alevin-fry-tutorials/2021/running-alevin-fry-fast/

Both tutorials using cellranger's refdata-gex-mm10-2020-A genome.

The only changes to the above tutorials are that both pipestances used unfiltered quant with 10x's whitelist, library type was changed to ISR and read length was corrected to 90 in the make_splici_txome part of selective alignment, to reflect our sequencing output.

I hope that I was able to explain this, and I look forward to hear your take on it. :)

rob-p commented 2 years ago

Thanks for the report @biOliver! I'm tagging @DongzeHE on this as well. My first question is always, is the data sharable? That always makes direct investigation easier. However, if not, we can probably diagnose with a back and forth.

Some other important details:

Are both of these approaches using the splici reference? The reason I ask is that the pseudoalignment with structural constraints tutorial you point to ( https://combine-lab.github.io/alevin-fry-tutorials/2021/running-alevin-fry-fast/) uses mapping directly against only the spliced transcriptome.
You mentioned this is snRNA-seq; can you describe what counts you are using in the analysis and if it's the same "layers" in both cases? Specifically, when run in splici-mode, alevin-fry's count matrix contains counts for unspliced (U), spliced (S) and ambiguous (A) UMI assignments. For snRNA-seq analysis, we generally recommend you combine (add) all 3 together into a single gene by cell matrix for analysis; is this happening in both pipelines being compared here?

Let us know if you have any questions about the above or need any further clarification on this.

Thanks! Rob

biOliver commented 2 years ago

Thanks for the quick reply and the warm engagement. :)

Ill get back to you about sharing data when I have finished consulting the owner about that.

Only the 'selective-alignment' approach uses splici. What I called 'pseudoalignment' is, as you say, pseudoalignment with structural constraints. And it is exactly the discrepancy between these two approaches, applied to our data at least, that I am curious about.

When running 'selective alignment' I use all three layers, and when running 'pseudoalignment' usa_mode == false, so i just use 'the matrix'.

rob-p commented 2 years ago

Great — so this is certainly likely to cause a difference, since the references being mapped are quite distinct. In the mean time, I'd suggest trying to see what happens with a more apples-to-apples comparison. Specifically, what if you use the splici reference with pseudoalignment with structural constraints? This will normalize out the reference against which you're mapping and will help determine if the difference is arising primarily due to alignment methodology or the reference being used. One other question, what is the number of reads mapped / the mapping rate among these two strategies?

biOliver commented 2 years ago

Alright, maybe my problem is more theoretical? "since the references being mapped are quite distinct." yes!, but they are both build from the same cellranger reference. I don't quite see why a gene like Agrp would have that amount of discrepancy when the t2/3g's and indicies are made on the same basis, even if its two different tutorials.

Apples-to-apples make sense. Here are the corresponding plots for pseudoalignment using the splici reference: Screenshot 2022-08-19 at 10 22 58 Screenshot 2022-08-19 at 10 32 22

Thank you for pointing out the uneven comparison. Pseudoalignment from the splici index seems to solve the dispersion issue (from looking at the feature plots). I am still unsure about the difference in expression level.

This is a new violin plot where i use the the [unnormalised] RNA assay, instead of the SCT assay, to compare expression between the, now, three approaches. Screenshot 2022-08-19 at 10 53 19

From this it seems that 'pseudo - splici' mimics the expression pattern of 'pseudo - structCon' but also mimics the lower expression levels of 'selective - splici'. Do you agree that it seems like a lot of Agrp expression disappears when going from splici reference to structural constraint reference?

Mapping rates for SI-TT-A4: 'selective - splici'. 80.3 % 'pseudo - structCon': 15.0 % 'pseudo - splici': 80.6 %

rob-p commented 2 years ago

Hi @biOliver,

Thanks for this comparison; I think it helps further our understanding a lot. So what I see now is that when using the splici transcriptome (regardless of selective alignment (SA) versus pseudo alignment with structural constraints (PSAC)) you seem to get a very similar pattern of expression. On the other hand, when you use PSAC with just the spliced transcriptome, you get much higher putative expression for this gene, but unexpected concentration for the cells in which it is most strongly expressed.

You could try selective alignment against just the spliced transcriptome (the only quadrant missing from our comparison). There I suspect you may get something between what you see with PSAC - splici and PSAC - spliced only. Since this is single nucleus data, and we expect a decent fraction of reads to derive from unspliced transcripts, I think usage of the splici transcriptome is probably very important in this context. As to why the expression of this gene looks so much higher in PSAC - spliced only, it’s hard to say definitively without looking at the data, but a guess might be that, since that mode is considering only a subset of the reference from which we might expect reads to derive, and PSAC is much more lenient than SA at taking the “best match” without explicitly scoring it to determine if it is good enough, it may be the case that reads are being spurious assigned to this gene under that mode. For example, imagine there is another highly-expressed locus in the unspliced transcriptome that better explains these reads. When using PSAC for alignment, if all matching k—mers from the read hit the target gene, then it will be assigned there, even if there are other k-mers that don’t match anywhere in the (spliced in this case) reference. On the other hand, SA will explicitly score each mapping to see if enough of the read aligns well to assign the read to this spliced transcript.

So, to have the most complete picture, I’d suggest we look at SA -> spliced only and see if we observe what I predict above. Again, if the data is shareable, we’d be happy to dig deeper and actually track down where some of the specific reads go between the modes. Overall, however, my guess is that this is just a case where using the splici txome may be especially important given the preponderance of unspliced reads in the data!

biOliver commented 2 years ago

Good idea with the final comparison. Before I go any further, or digest your message fully, I have gotten permission to share the data.

I am thinking of sending R1 and R2 fastqs from one of the reactions through a link. Would you prefer a specific way of transferring the data and do you have more info/data you want than the fastqs?

rob-p commented 2 years ago

Great! No preferred mechanism for sharing; anything works. In terms of other info, just the chemistry and tissue type and if you have them handy the few lines of analysis code you are using to make the plots above.

Thanks! Rob

rob-p commented 2 years ago

Hi @biOliver,

We made some substantial progress on this issue via our e-mail exchanges. I just wanted to loop back around here on GitHub to see:

Are there any remaining unresolved questions you have regarding the above? If so we can hash it out here or in another issue if you think it's conceptually separate.
Do you think it's worth us trying to summarize our findings for folks who stumble upon this issue / question in the future?

Thanks! Rob

rob-p commented 2 years ago

Any thoughts on this @biOliver?

biOliver commented 2 years ago

Yes, sorry for the late reply.

Everything [related to this issue] has been solved for me. In the end, upgrading salmon and alevin-fry to latest versions fixed everything.

So, it was not a difference in alignment strategy - as I incorrectly assumed - that made the difference, and therefor this issue should probably just be closed.

Thank you for all your help @rob-p :)

COMBINE-lab / alevin-fry

Difference in Agrp expression depending on alignment strategy #81