Closed ajjeon closed 1 year ago
You are correct that with the bam
mode, reads that map to different locations will not be deduped. Reads that map to the same location are deduped by their UMIs.
For fastq
mode, UMICollapse directly dedupes reads based on their sequences. If the UMI is extracted out of the sequence and placed in the read headers, then this will actually not take the UMI into account when deduping. This is probably why you are seeing significantly less reads when deduping with fastq
mode, as reads with different UMIs are collapsed together. If you want to avoid this, don't extract out the UMIs when using fastq
mode.
Thank you for clarifying. I will try the fastq mode again without the UMI extracted. I'm closing this issue. Cheers!
Hi,
Thank you for developing this tool.
I tried deduplicating using UMICollapse bam and fastq, both with the same sample with and without genome alignment. The number of deduplicated reads are very different. Is this possibly due to the multimapped reads? Say for instance, if a single read with a unique UMI gets mapped to multiple loci, will UMICollapse consider them as separate reads?
FYI, the input number of reads was 53,774,378, deduplicated reads using bam file was 23,073,402. Deduplicated reads using fastq file was 1,673,427. (The fastq function output also said the number of unique reads were 2,666,997).
Thank you