amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
284 stars 67 forks source link

High number of duplicates with -so option #126

Closed carlosmag closed 3 years ago

carlosmag commented 4 years ago

Hi, I am getting ≃ 2.5 more duplicates marked in snap paired with -so parameter than with snap paired without -so option + pipe to samtools markdup or picard MarkDuplicates. Stats obtained with samtools flagstat: 646886 vs 264372 duplicates.

Is there any issue with snap or interoperability with other tools?

Test genome and bam files here Reference genome here

SNAP version 1.0dev.102 samtools 1.10 Picard 2.23.0

bolosky commented 4 years ago

We’ve been working on getting concordance with Picard markdup right now. Arun is doing the work, maybe he can comment.

From: Carlos Magalhães notifications@github.com Sent: Thursday, June 25, 2020 11:34 AM To: amplab/snap snap@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [amplab/snap] High number of duplicates with -so option (#126)

Hi, I am getting ≃ 2.5 more duplicates marked in snap paired with -so parameter than with snap paired without -so option + pipe to samtools markdup or picard MarkDuplicates. Stats obtained with samtools flagstat: 646886 vs 264372 duplicates.

Is there any issue with snap or interoperability with other tools?

Test genome and bam files herehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmega.nz%2Ffolder%2FdERWDI6I%2337xE-kQd_NjtP06-xEyVAg&data=02%7C01%7Cbolosky%40microsoft.com%7C0dae7d2a84644ad96ee408d819365048%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637287068354264056&sdata=3khEt%2F5jHAxLOeTBLtr9PAOy0bIi1MFXAcwzJP16zck%3D&reserved=0 Reference genome herehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffigshare.com%2Farticles%2FGenome_of_the_inferred_most_recent_common_ancestor_of_the_Mycobacterium_tuberculosis_complex%2F11500980%2F1&data=02%7C01%7Cbolosky%40microsoft.com%7C0dae7d2a84644ad96ee408d819365048%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637287068354264056&sdata=BVGWHw1msHW7JCLdr%2BeYK9kCNZBWNy3Cn7ao%2FGmClkw%3D&reserved=0

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F126&data=02%7C01%7Cbolosky%40microsoft.com%7C0dae7d2a84644ad96ee408d819365048%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637287068354274052&sdata=ydMH5TTRSmZiHnuWzKWiG6Wl7ozgjeKDrWncW4jpCzQ%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWJGQDF4LFL5FWYG64LRYOKBDANCNFSM4OIUGCAA&data=02%7C01%7Cbolosky%40microsoft.com%7C0dae7d2a84644ad96ee408d819365048%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637287068354274052&sdata=lEx%2BoGEqq7LoZNQFgYYB4Ow9zckIVE3eZdtjqAiENfI%3D&reserved=0.

arun-sub commented 4 years ago

We are aware of the differences in SNAP duplicate marking with respect to Picard MarkDuplicates. The differences in SNAP are mainly due to: (1) not taking soft-clips into account and (2) not marking singleton read duplicates (i.e., when only read in the pair is mapped).

I tried your dataset on a version of a new release that we are currently working on and have seen significantly fewer differences (~1000-2000 reads w/ Picard Markdup). Unfortunately the new version of the code has diverged significantly from the current version that you are using, making it difficult for us to backpatch these changes onto your version, The new release will include bug fixes as well as support for affine gap scoring and performance improvements. We plan to release this in the next few months before the end of summer.

I will keep this issue open and will update here once we have resolved the discrepancies.

--Arun

bolosky commented 3 years ago

The newly released 1.0 version has nearly identical duplicate marking to Picard. I'm going to close this, if you still see problems please reopen it or make a new issue.