Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
MIT License
62 stars 8 forks source link

Subtle annoyances with `umiDist` and `charSet` #7

Open Daniel-Liu-c0deb0t opened 3 years ago

Daniel-Liu-c0deb0t commented 3 years ago

The handling of undetermined (N) bp is not the best right now, which means that using umiDist and charSet together will produce bugs when there are Ns in the UMI. This is because charSet does not update the separate N bit set. Similar issues appear with cloning UMIs with Ns.

This isn't an issue in most of the code right now, but for any future changes, this needs to be looked into.