Open abracarambar opened 3 years ago
For both the average and the max statistics, they are calculated using the number of unique UMIs at each alignment position. The number of unique UMIs is counted by identity (no error tolerance). This differs slightly from the number of grouped/collapsed UMIs at each alignment position, because grouping involves clustering UMIs that may have errors. After counting the unique UMIs, error-tolerant collapsing is performed.
The reason for these statistics is that it helps identify whether error-tolerant grouping/collapsing could be the bottleneck in terms of speed.
I see, so the number of reads after deduplicating is before or after grouping/collapsing?
Deduplicating typically means the whole process. There's two steps: 1. find unique UMIs 2. group the unique UMIs in an error-tolerant way. Collapsing is used sometimes because only one UMI from each group is kept (the group is collapsed). Sorry, I'm not very clear when I use these terms.
Dear Daniel Lu, Does maximum number of UMIs over all alignment positions mean: the maxium number of UMIs recovered at a given alignment position?
Done reading input file into memory! Number of input reads 8779688 Number of removed unmapped reads 8746016 Number of unremoved reads 33672 Number of unique alignment positions 266 Average number of UMIs per alignment position 126.28195488721805 Max number of UMIs over all alignment positions 5466 Number of reads after deduplicating 32818