Open deweesd opened 7 years ago
@deweesd Ah, you are totally correct; I forgot to include the building of the UMI dict to remove ill-tagged reads. I do not specify how the outputs of each read are compared, but I'm thinking of using local variable assignments within the for-loop of the comparator function, rather than a dict, as only 2 reads are being compared at any given time, so there isn't as much information to reference -- but I will make this clear in the pseudocode.
And yes, I did make changes to the script to include the UMI removal; good catch! 👍
Any thoughts on how to avoid reading through the same lines many times? The main issue I have with the current algorithm is that it requires every non-duplicate read to end up as a reference read at some point, which means it will take O^n time... Most other strategies I've read through just compare lines next to each other, but as I explain here, edge cases where duplicates are separated by non-duplicates are possible and would be missed by this method.
First off, I feel your presentation of the problem was very clear and you obviously have a good understanding on how to solve this issue. Given the fact that we are disregarding and aspect of normalization, I feel your 2 functions and main function in pseudocode makes sense in order to properly remove any PCR duplicates. I'm assuming you're setting some sort of global dictionary(s) in order to collect those outputs that return as TRUE when parsing through the SAM file? Again, just referencing your pseudocode for this.
All in all, it's very clear and I feel like this has been helpful for me to fully understand the scope of the problem and how to go about it using a python script. Hope this comment was helpful! I've noticed you've made adjustments recently which initially I was going to comment on...Let me know if you have any questions!