I liked your idea to write the invalid UMI reads to a separate file instead of doing nothing at all with them. I think I will utilize this in my script as well.
Your algorithm looks like it will capture all cases of duplicates and correctly distinguish PCR duplicates from biological ones.
I wonder though, when you parse each chromosome in the sorted .sam, are the 192 sets of positions you make to be the values of a lookup table with the UMI and strand as the value? If so I wonder if it is better to use the positions as keys as save the set of corresponding UMI+strand data. Essentially you create way more lists per chromosome, but these sets could only ever get so big. I think the advantage of this is that you can essentially clear this data more often, when you have moved sufficiently far down the chromosome, there's no way you could be mapping to reads which are far away, you could clear your memory every thousand or so lines..
For that very last sentence in the above I meant to say remove the sets corresponding to positions a thousand positions away not clear out the entire dictionary every thousand lines!
I liked your idea to write the invalid UMI reads to a separate file instead of doing nothing at all with them. I think I will utilize this in my script as well.
Your algorithm looks like it will capture all cases of duplicates and correctly distinguish PCR duplicates from biological ones.
I wonder though, when you parse each chromosome in the sorted
.sam
, are the 192 sets of positions you make to be the values of a lookup table with the UMI and strand as the value? If so I wonder if it is better to use the positions as keys as save the set of corresponding UMI+strand data. Essentially you create way more lists per chromosome, but these sets could only ever get so big. I think the advantage of this is that you can essentially clear this data more often, when you have moved sufficiently far down the chromosome, there's no way you could be mapping to reads which are far away, you could clear your memory every thousand or so lines..