CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

AssertionError: not all umis are the same length - Why is this an assertion? #560

Closed exsquire closed 2 years ago

exsquire commented 2 years ago

There are several issues closed on this topic that have to do with how the UMIs are first placed into the read IDs, but this issue has to do with why this is an assertion at all.

According to the UMI-tools paper, network deduplication depends on edit distances - this specifically includes the cases of insertions and deletions, i.e. UMIs that may have more or less than what was intended. If all UMIs must pass the assertion of equal length, in what cases are deduplication being used to address umi indels?

For context, UMIs from unaligned BAMs from PacBio sequencers contain leading and trailing UMI sequences - their own software called 'mimux' extracts these UMIs and places them into tags of the uBAM and these can differ in length. Converting the uBAMs to single-end FASTQs with the ID pattern modified to contain the concatenated UMIs (e.g. READ1_ produces umis of varying length.

exsquire commented 2 years ago

Ah, I looked over the paper again and noted the section where UMI indels were deemed to be a rare enough occurrence to be left off and so UMI_tools is specifically not for UMIs of different lengths.

IanSudbery commented 2 years ago

Designing UMI tools to handle UMIs of different length would have mean doing a full alignment of UMIs, which is far more computationally expensive than a simple Hamming Distance calculation. As UMI-tools spends most of its time doing edit distance calculations, and indels in unintensionally different lengthed UMIs in illumina data was rare (and no one had dreamed of barcoding PacBio back in 2014), we decided that this wasn't worth the run time trade off.

If you want to experiment with making UMI-tools handle UMIs of different lengths you could fork the repo and swap out the code of the edit distance function in _dedup_umi.pyx.