Closed carlosmag closed 3 years ago
We’ve been working on getting concordance with Picard markdup right now. Arun is doing the work, maybe he can comment.
From: Carlos Magalhães notifications@github.com Sent: Thursday, June 25, 2020 11:34 AM To: amplab/snap snap@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [amplab/snap] High number of duplicates with -so option (#126)
Hi, I am getting ≃ 2.5 more duplicates marked in snap paired with -so parameter than with snap paired without -so option + pipe to samtools markdup or picard MarkDuplicates. Stats obtained with samtools flagstat: 646886 vs 264372 duplicates.
Is there any issue with snap or interoperability with other tools?
Test genome and bam files herehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmega.nz%2Ffolder%2FdERWDI6I%2337xE-kQd_NjtP06-xEyVAg&data=02%7C01%7Cbolosky%40microsoft.com%7C0dae7d2a84644ad96ee408d819365048%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637287068354264056&sdata=3khEt%2F5jHAxLOeTBLtr9PAOy0bIi1MFXAcwzJP16zck%3D&reserved=0 Reference genome herehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffigshare.com%2Farticles%2FGenome_of_the_inferred_most_recent_common_ancestor_of_the_Mycobacterium_tuberculosis_complex%2F11500980%2F1&data=02%7C01%7Cbolosky%40microsoft.com%7C0dae7d2a84644ad96ee408d819365048%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637287068354264056&sdata=BVGWHw1msHW7JCLdr%2BeYK9kCNZBWNy3Cn7ao%2FGmClkw%3D&reserved=0
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F126&data=02%7C01%7Cbolosky%40microsoft.com%7C0dae7d2a84644ad96ee408d819365048%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637287068354274052&sdata=ydMH5TTRSmZiHnuWzKWiG6Wl7ozgjeKDrWncW4jpCzQ%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWJGQDF4LFL5FWYG64LRYOKBDANCNFSM4OIUGCAA&data=02%7C01%7Cbolosky%40microsoft.com%7C0dae7d2a84644ad96ee408d819365048%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637287068354274052&sdata=lEx%2BoGEqq7LoZNQFgYYB4Ow9zckIVE3eZdtjqAiENfI%3D&reserved=0.
We are aware of the differences in SNAP duplicate marking with respect to Picard MarkDuplicates. The differences in SNAP are mainly due to: (1) not taking soft-clips into account and (2) not marking singleton read duplicates (i.e., when only read in the pair is mapped).
I tried your dataset on a version of a new release that we are currently working on and have seen significantly fewer differences (~1000-2000 reads w/ Picard Markdup). Unfortunately the new version of the code has diverged significantly from the current version that you are using, making it difficult for us to backpatch these changes onto your version, The new release will include bug fixes as well as support for affine gap scoring and performance improvements. We plan to release this in the next few months before the end of summer.
I will keep this issue open and will update here once we have resolved the discrepancies.
--Arun
The newly released 1.0 version has nearly identical duplicate marking to Picard. I'm going to close this, if you still see problems please reopen it or make a new issue.
Hi, I am getting ≃ 2.5 more duplicates marked in
snap paired with -so parameter
than withsnap paired without -so option + pipe to samtools markdup or picard MarkDuplicates
. Stats obtained with samtools flagstat: 646886 vs 264372 duplicates.Is there any issue with snap or interoperability with other tools?
Test genome and bam files here Reference genome here
SNAP version 1.0dev.102
samtools 1.10
Picard 2.23.0