Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.
I was having problems with the annotation of the .vcf output from Pindel, due to the presence of duplicate entries in the .vcf. For example:
#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | NORMAL | TUMOR
chr2 | 113983582 | . | T | TGGGAGTCCGGGGCCAGGAGGGACAGAGGAGTCAGTATTCTGTATTTTCAACGCCCCCCACCCGGACGGGTGGGAGGGT | . | PASS | END=113983582;HOMLEN=0;SVLEN=78;SVTYPE=INS | GT:AD | 0/0:1083,1 | 0/0:1115,0
chr2 | 113983582 | . | T | TGGGAGTCCGGGGCCAGGAGGGACAGAGGAGTCAGTATTCTGTATTTTCAACGCCCCCCACCCGGACGGGTGGGAGGGT | . | PASS | END=113983582;HOMLEN=0;SVLEN=78;SVTYPE=INS | GT:AD | 0/0:1083,1 | 0/0:1115,0
There are many such entries in the .vcf file produced.
I thought this might an issue with the .vcf conversion from the original data format, but the duplicates actually appear inside the raw data output as well:
I was having problems with the annotation of the .vcf output from Pindel, due to the presence of duplicate entries in the .vcf. For example:
There are many such entries in the .vcf file produced.
I thought this might an issue with the .vcf conversion from the original data format, but the duplicates actually appear inside the raw data output as well:
Why are duplicate entries being reported? And is it safe to remove them? What is the recommended removal method?
I am using Pindel version 0.2.5b9