COMBINE-lab / simpleaf

A rust framework to make using alevin-fry even simpler
BSD 3-Clause "New" or "Revised" License
41 stars 3 forks source link

Cell-by-isoform matrix #115

Open sameer-aryal opened 8 months ago

sameer-aryal commented 8 months ago

I wanted to ask if it was possible to generate a barcode-by-isoform count matrix (instead of gene-level counts) using simpleaf; thanks very much.

rob-p commented 8 months ago

Hi @sameer-aryal,

Yes, this is possible. To do this, you'd want to replace the tg2 or t2g_3col file, which is a mapping from transcripts to gene (or transcript to gene + splicing-status) with a corresponding file that maps transcripts to themselves. You can, of course, decide how you want to handle the splicing status in this case (e.g. consider each merged intronic span as a separate transcript, group them all together into a single intronic supertranscript for the gene, etc.).

However, the big caveat here is that while this is easy to to technically, current 3' tag-based protocols are likely not going to be very good at giving you isoform level information reliably. This is because they are sequencing in a strongly-biased way from the 3' end of the transcripts, so, at most, you may be able to distinguish families of transcripts that share different terminal exons. Likewise, the per-cell depth of coverage is very low, so there is not much information to help with resolving ambiguous reads (I'm guessing you'd want to use a UMI resolution method in this case that turns on the EM to help avoid losing too many reads to multimapping).

Anyway, we're happy to help you out if you want to give this a try. I'm pinging @DongzeHE so he can chime in here as well if he wants.

Best, Rob

sameer-aryal commented 8 months ago

Dear @rob-p,

Thanks very much for the guidance, as well for creating and maintaining this excellent tool.

…so, at most, you may be able to distinguish families of transcripts that share different terminal exons.

This is exactly the case I wish to use this approach for.

you'd want to replace the tg2 or t2g_3col file, which is a mapping from transcripts to gene (or transcript to gene + splicing-status) with a corresponding file that maps transcripts to themselves.

I will give this a try; thanks very much again.