Interpreting log output

Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.

MIT License

62 stars 8 forks source link

Interpreting log output #11

Open abracarambar opened 3 years ago

abracarambar commented 3 years ago

Dear Daniel Lu, Does maximum number of UMIs over all alignment positions mean: the maxium number of UMIs recovered at a given alignment position?

Done reading input file into memory! Number of input reads 8779688 Number of removed unmapped reads 8746016 Number of unremoved reads 33672 Number of unique alignment positions 266 Average number of UMIs per alignment position 126.28195488721805 Max number of UMIs over all alignment positions 5466 Number of reads after deduplicating 32818

Daniel-Liu-c0deb0t commented 3 years ago

For both the average and the max statistics, they are calculated using the number of unique UMIs at each alignment position. The number of unique UMIs is counted by identity (no error tolerance). This differs slightly from the number of grouped/collapsed UMIs at each alignment position, because grouping involves clustering UMIs that may have errors. After counting the unique UMIs, error-tolerant collapsing is performed.

The reason for these statistics is that it helps identify whether error-tolerant grouping/collapsing could be the bottleneck in terms of speed.

abracarambar commented 3 years ago

I see, so the number of reads after deduplicating is before or after grouping/collapsing?

Daniel-Liu-c0deb0t commented 3 years ago

Deduplicating typically means the whole process. There's two steps: 1. find unique UMIs 2. group the unique UMIs in an error-tolerant way. Collapsing is used sometimes because only one UMI from each group is kept (the group is collapsed). Sorry, I'm not very clear when I use these terms.