Transcript level quantification?

Hi @zhanglab2008,

Sorry to have missed this for so long!

In general, transcript-level analysis is highly unreliable with tagged-end single-cell sequencing data. This is because the vast majority of sequencing reads are drawn from close to the 3' end of the underlying transcripts, meaning that reads usually derive from the final or penultimate exon in a transcript (or an intervening intron), and the resulting reads are therefore usually ambiguous as to their transcript of origin within a gene. This is in contrast to bulk RNA-seq where reads are drawn from all along a transcript's body. In the bulk context, it is therefore often very effective to use an inference algorithm to infer the assignment of ambiguous reads from reads that can be uniquely assigned to a transcript, considering also the expected distribution of reads across the transcripts' bodies. However, such methods are (currently) much less effective in the tagged-end context.

It is certainly possible to "trick" tools into doing a transcript-level analysis. This can be done by providing a "transcript-to-gene" map to alevin-fry that simply maps each relevant transcript to itself. The mapping is how alevin-fry determines which input alignments (which are performed with respect to the index which is almost always over transcripts) should be aggregated and assumed to derive from the same gene. Those aggregated alignments are then used for e.g. for UMI deduplication. If one instead provides a mapping that performs no aggregation — every input transcript is mapped to itself rather than a gene feature — the subsequent processing will take place a the transcript level. Of course, due to the caveats mentioned above, this should be done with appropriate caution of the limitations it raises.

Anyway, sorry again for the long delay in replying, and feel free to reach out if you have further questions.

Best, Rob

COMBINE-lab / alevin-fry

Transcript level quantification? #34