Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
MIT License
62 stars 8 forks source link

Number of unique reads fastq mode #20

Open lisagrigoreva opened 1 year ago

lisagrigoreva commented 1 year ago

Hi, I was worried about what exactly represents the output of 'Number of unique reads' in collapsing reads from fastq? Number of input read 6609696 Number of unique reads 3885326 Number of reads after deduplicating 3028828

Because it seems like the number of unique reads should be similar with number of reads after deduplicating

Daniel-Liu-c0deb0t commented 1 year ago

Unique reads in this case represents the number of reads with UMIs that are not exactly identical to any other read's UMI. This does not account for errors in the UMIs, which is why the count is greater than the number of reads after deduplicating. The deduplication process allows similar (but not exactly identical) UMIs to be grouped together.

lisagrigoreva commented 1 year ago

Thank you! Is it possible somehow to get reads with identical UMIs ? I suppose putting p=1?

Daniel-Liu-c0deb0t commented 1 year ago

If you want to only deduplicate reads if they have the exact same UMI, you should pass in -k 0 to indicate that zero errors are tolerated.