You have a great pseudocode to start. You clearly have a grasp on the problem and have identified your approach to solving it. Some things that came up for me when I was reading through that you might find useful are listed below:
In your soft clipping function, which is a great idea, I'm confused by your input and output example. I can't tell if that is for a forward or reverse read and have trouble understanding why the output would be 19 as the starting position. This may just be my misunderstanding, but I would love to talk it over with you to get some clarity.
In your comparison of UMI's, I think that is a great approach, but I wonder if looping through the same file multiple times might take longer than doing it in a previous step of filtering or doing it as you go line by line along with your other comparisons.
This comment has to do with steps 3 and 4 as well as when you choose to use samtools to sort. I'm assuming in 3&4 you are comparing each line to the next one, or are you doing one line compared to all and looping multiple times (this latter option would get rather intensive for processing time/memory). I think we will be given sorted sam files already, but if not sorting before starting to read the file would be better probably. Then as you are going line by line, unless you have sorted after adjusting start positions and unless you sort by your desired columns you want to compare, you might not be finding every duplicate as they might be spread out throughout the file instead of next to each other in the file you are using.
Also, make sure the output sam file in the end will have their original 1-based left most mapping position and not their adjusted position as this will cause errors in downstream applications of the sam file.
Lastly, I really like the question you are considering at the very end. How to know which of the duplicates to keep, and I think quality would be a great option for that.
Overall, great job! Let me know if you have any questions for me!
Hi Emily,
You have a great pseudocode to start. You clearly have a grasp on the problem and have identified your approach to solving it. Some things that came up for me when I was reading through that you might find useful are listed below:
Overall, great job! Let me know if you have any questions for me!