COMBINE-lab / alevin-fry

🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
https://alevin-fry.readthedocs.io
BSD 3-Clause "New" or "Revised" License
170 stars 15 forks source link

Transcript level quantification? #34

Closed zhanglab2008 closed 2 years ago

zhanglab2008 commented 2 years ago

Hi team, very nice work! I noticed that the tutorial was made to generate gene-level quantification (https://combine-lab.github.io/alevin-fry-tutorials/2021/improving-txome-specificity/). Can I use alevin-fry to get transcript-level count for each single cell? Thanks for developing the tool!

rob-p commented 2 years ago

Hi @zhanglab2008,

Sorry to have missed this for so long!

In general, transcript-level analysis is highly unreliable with tagged-end single-cell sequencing data. This is because the vast majority of sequencing reads are drawn from close to the 3' end of the underlying transcripts, meaning that reads usually derive from the final or penultimate exon in a transcript (or an intervening intron), and the resulting reads are therefore usually ambiguous as to their transcript of origin within a gene. This is in contrast to bulk RNA-seq where reads are drawn from all along a transcript's body. In the bulk context, it is therefore often very effective to use an inference algorithm to infer the assignment of ambiguous reads from reads that can be uniquely assigned to a transcript, considering also the expected distribution of reads across the transcripts' bodies. However, such methods are (currently) much less effective in the tagged-end context.

It is certainly possible to "trick" tools into doing a transcript-level analysis. This can be done by providing a "transcript-to-gene" map to alevin-fry that simply maps each relevant transcript to itself. The mapping is how alevin-fry determines which input alignments (which are performed with respect to the index which is almost always over transcripts) should be aggregated and assumed to derive from the same gene. Those aggregated alignments are then used for e.g. for UMI deduplication. If one instead provides a mapping that performs no aggregation — every input transcript is mapped to itself rather than a gene feature — the subsequent processing will take place a the transcript level. Of course, due to the caveats mentioned above, this should be done with appropriate caution of the limitations it raises.

Anyway, sorry again for the long delay in replying, and feel free to reach out if you have further questions.

Best, Rob