genome / pindel

Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.
GNU General Public License v3.0
162 stars 89 forks source link

Duplicate entries in Pindel output #109

Open stevekm opened 5 years ago

stevekm commented 5 years ago

I was having problems with the annotation of the .vcf output from Pindel, due to the presence of duplicate entries in the .vcf. For example:


#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | NORMAL | TUMOR

chr2 | 113983582 | . | T | TGGGAGTCCGGGGCCAGGAGGGACAGAGGAGTCAGTATTCTGTATTTTCAACGCCCCCCACCCGGACGGGTGGGAGGGT | . | PASS | END=113983582;HOMLEN=0;SVLEN=78;SVTYPE=INS | GT:AD | 0/0:1083,1 | 0/0:1115,0

chr2 | 113983582 | . | T | TGGGAGTCCGGGGCCAGGAGGGACAGAGGAGTCAGTATTCTGTATTTTCAACGCCCCCCACCCGGACGGGTGGGAGGGT | . | PASS | END=113983582;HOMLEN=0;SVLEN=78;SVTYPE=INS | GT:AD | 0/0:1083,1 | 0/0:1115,0

There are many such entries in the .vcf file produced.

I thought this might an issue with the .vcf conversion from the original data format, but the duplicates actually appear inside the raw data output as well:

$ grep 113983582 pindel_output/*
pindel_output/_SI:530   I 78    NT 78 "GGGAGTCCGGGGCCAGGAGGGACAGAGGAGTCAGTATTCTGTATTTTCAACGCCCCCCACCCGGACGGGTGGGAGGGT"  ChrID chr2  BP 113983582    113983583   BP_range 113983581  113983583   Supports 1  1   + 1 1   - 0 0   S1 2    SUM_MS 60   2   NumSupSamples 1 1   NORMAL 1083 1071 1 1 0 0    TUMOR 1115 1105 0 0 0 0
pindel_output/_SI:552   I 78    NT 78 "GGGAGTCCGGGGCCAGGAGGGACAGAGGAGTCAGTATTCTGTATTTTCAACGCCCCCCACCCGGACGGGTGGGAGGGT"  ChrID chr2  BP 113983582    113983583   BP_range 113983581  113983583   Supports 1  1   + 1 1   - 0 0   S1 2    SUM_MS 60   2   NumSupSamples 1 1   NORMAL 1083 1071 1 1 0 0    TUMOR 1115 1105 0 0 0 0

Why are duplicate entries being reported? And is it safe to remove them? What is the recommended removal method?

I am using Pindel version 0.2.5b9