How is collapsing of the corrected UMIs for transcript quantification done?

epi2me-labs / wf-single-cell

Other

75 stars 39 forks source link

How is collapsing of the corrected UMIs for transcript quantification done? #135

Closed skudashev closed 3 months ago

skudashev commented 3 months ago

Ask away!

Hello, I see that you use UMI-tools for UMI correction, but how do you collapse UMIs for quantification? Do you just randomly pick 1 read per UMI or do you use a process similar to UMI-tools which selects the read with best mapping score? I have been trying to use the tagged BAM generated by your pipeline to do transcript quantification with a different tool but the only way to use umi_tools dedup is if I remove all the reads where the corrected barcode (UB) < 12nt. Kind regards, Sofia

nrhorner commented 3 months ago

Hi Sofia

The UMIs are collapsed for the expression matrix creation only. This is done by grouping reads by cell barcode and gene/transcript and getting the unique counts UMI counts.

The tagged BAMs are not subjected to deduplication. All reads assigned barcode and UMI are output into there.

You last comment seems to be that there is an issue when the corrected UMI sequence is les than the expected 12nt. At what sort of frequency do you see this happing?

skudashev commented 3 months ago

Hello,

Thank you for your explanation. That makes sense, so the quantification is done by using the transcript and UMI assignments, rather than collapsing UMIs pre mapping to transcript. <12nt UMIs are not a big issue as these short UMI reads make up only 0.005% of the total UMI tagged reads.

Best, Sofia