Soft clipped reads are not correctly deduplicated

MikkelSchubert / paleomix

Pipelines and tools for the processing of ancient and modern HTS data.

https://paleomix.readthedocs.io/en/stable/

MIT License

43 stars 19 forks source link

Soft clipped reads are not correctly deduplicated #10

Closed grahamgower closed 7 years ago

grahamgower commented 7 years ago

Hi Mikkel,

We've noticed a problem with deduplication after trying out bwa-mem on some of our data. The underlying reason seems to be that PCR duplicates can exhibit different sequencing errors, and thus be soft-clipped at different positions. Deduplication based on the length and cigar string is thus problematic. Further, if the reads are soft-clipped at the beginning of a forward-mapping sequence (or the end of a reverse-mapping sequence), the reported mapping positions will also be different. The attached file shows 4 reads which should be marked as duplicates but are not.

-Graham

z.txt

grahamgower commented 7 years ago

Maybe something like the attached patch will do the trick (it needs testing)...

softclip-dedup.patch.txt

grahamgower commented 7 years ago

Oops, try this one instead.

softclip-dedup.patch2.txt

MikkelSchubert commented 7 years ago

Hey Graham,

Thank you very much for the report and for the patches!

While I see the problem, I am not sure that treating clipped bases as part of the alignment (which they are not, by definition) is something that you can rely on. The possibility of clipped reads extending past the contig termini also poses some questions, since the clipped bases could, potentially, map to different contigs (or not at all).

I have investigated this briefly, and between Picard MarkDuplicates and SAMTools rmdup, Picard appears to follow your strategy while SAMTools does not. I intend to look into this further, as time permits, and potentially add your patch in an upcoming update.

Best, Mikkel

MikkelSchubert commented 7 years ago

Hey Graham, I apologize for taking so long, but unfortunately I have been busy wrapping up several projects these last few months.

I have been working on a updated version of the script, inspired by your patch, which I intend to finalize by next week. The current version is attached, if you want to have a chance to look at it now.

Best, Mikkel

rmdup_collapsed_softclip.txt

grahamgower commented 7 years ago

Hi Mikkel,

I've looked over your new code and it looks right. I've not tested it much though. Do you have a test framework that you use for such things?

-G

MikkelSchubert commented 7 years ago

Hi Graham, Unfortunately I do not yet have a framework for automatically testing that, though it is something I am interested in implementing. So I have manually been carrying out tests on various datasets, big and small, to ensure that the new version of rmdup_collapsed performs as expected. Long story short, I have have now released a version of PALEOMIX (v1.2.9) that includes the improved script, which should address this issue.

Once again thank you for reporting this issue, and apologies for taking so long. Do not hesitate to open additional issues if you should run into other problems with PALEOMIX.

Cheers