Closed grahamgower closed 7 years ago
Maybe something like the attached patch will do the trick (it needs testing)...
Oops, try this one instead.
Hey Graham,
Thank you very much for the report and for the patches!
While I see the problem, I am not sure that treating clipped bases as part of the alignment (which they are not, by definition) is something that you can rely on. The possibility of clipped reads extending past the contig termini also poses some questions, since the clipped bases could, potentially, map to different contigs (or not at all).
I have investigated this briefly, and between Picard MarkDuplicates and SAMTools rmdup, Picard appears to follow your strategy while SAMTools does not. I intend to look into this further, as time permits, and potentially add your patch in an upcoming update.
Best, Mikkel
Hey Graham, I apologize for taking so long, but unfortunately I have been busy wrapping up several projects these last few months.
I have been working on a updated version of the script, inspired by your patch, which I intend to finalize by next week. The current version is attached, if you want to have a chance to look at it now.
Best, Mikkel
Hi Mikkel,
I've looked over your new code and it looks right. I've not tested it much though. Do you have a test framework that you use for such things?
-G
Hi Graham, Unfortunately I do not yet have a framework for automatically testing that, though it is something I am interested in implementing. So I have manually been carrying out tests on various datasets, big and small, to ensure that the new version of rmdup_collapsed performs as expected. Long story short, I have have now released a version of PALEOMIX (v1.2.9) that includes the improved script, which should address this issue.
Once again thank you for reporting this issue, and apologies for taking so long. Do not hesitate to open additional issues if you should run into other problems with PALEOMIX.
Cheers
Hi Mikkel,
We've noticed a problem with deduplication after trying out bwa-mem on some of our data. The underlying reason seems to be that PCR duplicates can exhibit different sequencing errors, and thus be soft-clipped at different positions. Deduplication based on the length and cigar string is thus problematic. Further, if the reads are soft-clipped at the beginning of a forward-mapping sequence (or the end of a reverse-mapping sequence), the reported mapping positions will also be different. The attached file shows 4 reads which should be marked as duplicates but are not.
-Graham
z.txt