difference in deduplicated read numbers between BAM and FASTQ

Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.

MIT License

62 stars 8 forks source link

difference in deduplicated read numbers between BAM and FASTQ #23

Closed ajjeon closed 1 year ago

ajjeon commented 1 year ago

Hi,

Thank you for developing this tool.

I tried deduplicating using UMICollapse bam and fastq, both with the same sample with and without genome alignment. The number of deduplicated reads are very different. Is this possibly due to the multimapped reads? Say for instance, if a single read with a unique UMI gets mapped to multiple loci, will UMICollapse consider them as separate reads?

FYI, the input number of reads was 53,774,378, deduplicated reads using bam file was 23,073,402. Deduplicated reads using fastq file was 1,673,427. (The fastq function output also said the number of unique reads were 2,666,997).

Thank you

Daniel-Liu-c0deb0t commented 1 year ago

You are correct that with the bam mode, reads that map to different locations will not be deduped. Reads that map to the same location are deduped by their UMIs.

For fastq mode, UMICollapse directly dedupes reads based on their sequences. If the UMI is extracted out of the sequence and placed in the read headers, then this will actually not take the UMI into account when deduping. This is probably why you are seeing significantly less reads when deduping with fastq mode, as reads with different UMIs are collapsed together. If you want to avoid this, don't extract out the UMIs when using fastq mode.

ajjeon commented 1 year ago

Thank you for clarifying. I will try the fastq mode again without the UMI extracted. I'm closing this issue. Cheers!