COMBINE-lab / alevin-fry

🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
https://alevin-fry.readthedocs.io
BSD 3-Clause "New" or "Revised" License
156 stars 15 forks source link

Calculating spliced/unpliced ratio and what to do with the ambigous count #90

Closed mdmanurung closed 1 year ago

mdmanurung commented 1 year ago

Dear authors,

I want to calculate the spliced/unspliced gene ratio but I am not sure what to do with the ambiguous count table. Should I just remove it or combine it to one of the spliced or unspliced counts?

I am a beginner in this area and so I'd like to apologise in advance for the naive question.

Best, Mikhael

DongzeHE commented 1 year ago

Hello @mdmanurung

Thanks so much for choosing alevin-fry.

Shortly, either removing the ambiguous counts or splitting it 50/50 into spliced (S) and unspliced (U) counts is fine. These are what people usually do in their research.

TL;DR: When we say a gene in a cell has an unspliced UMI, it means the splicing status of the mRNA molecule represented by this UMI is ambiguous; i.e., the reads of this UMI mapped equally well to some spliced transcripts and some introns of this gene. Therefore, when calculating the S/U ratio, the count of these ambiguous UMIs can be either ignored, because their splicing status is ambiguous; or split half-half into S and U counts because their reads mapped equally well to S and U.

More generally, this question relates to an active research question people are exploring now. That is, can we compare S and U counts directly without any transfer learning or domain adaptation? This is mainly because introns have internal poly-A stretches, and those stretches could become potential priming sites. If this happens, the priming mechanism of spliced transcripts (poly-A tail priming) might be totally different from that of unspliced transcripts (poly-A tail priming + internal poly-A priming). See this technical note from 10x and this paper.

In addition, one caveat in the spliced and unspliced count inferred by alevin-fry, and all other mainstream quantification tools, is that unspliced UMI counts are represented by intronic UMIs counts. However, as we know, unspliced transcripts also contain exons, which means we prefer to assign UMIs as spliced compared with unspliced ones. People do this because they (and we) want to include as many UMIs as possible in our (spliced) count matrix.

These are all the dark sides of the question. Nonetheless, if we assume that the assumptions held by single-cell are valid and the effect of these caveats is minor, simply removing ambiguous counts or splitting them 50/50 into S and U is fine.

Best, Dongze

mdmanurung commented 1 year ago

Hi Dongze,

Thank you so much for the detailed answer. I'll need some time to let that sink in.

Best, Mikhael