genome / pindel

Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.
GNU General Public License v3.0
162 stars 90 forks source link

unique supporting read counts in pindel >v0.2.5b5 #34

Open conodera opened 8 years ago

conodera commented 8 years ago

I recently did a comparison of pindel v0.2.5b8 and v0.2.5a7. I noticed that for deletions, small insertions, and tandem duplications, the unique supporting read counts always match the total supporting read counts in v0.2.5b8, but this is not the case in v0.2.5a7. I believe the source of this issue is the below line in a commit from 7/26/15, which requires reads to have the same name in order to be considered duplicates. This line is in a function MarkDuplicates in reporter.cpp which is later used for SI, TD, and deletions/DI, but not inversions, which is consistent with the observed behavior. https://github.com/genome/pindel/commit/33943f76c0a001e80be34b4ead344f4ba47f2ebb#diff-535281b8d88bf426c63d1f7988dd461dR966

It's simple for me to delete the line and recompile, but I wanted to alert you to this issue and double check that I'm not misunderstanding something. Thank you for your help!

ZhenyuZ commented 5 years ago

We detected the same issue in the most recent Pindel version, and found some variants are supported solely by a set of PCR duplicate reads. If Pindel could use existing MarkDup flags in input BAMs, or reverse the logic to not "requires reads to have the same name in order to be considered duplicates", these artifacts could have not been called. Thanks