genome / pindel

Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.
GNU General Public License v3.0
162 stars 89 forks source link

Question: Disagreement in the coordinates of VCF and internal formats for Pindel #70

Open javang opened 6 years ago

javang commented 6 years ago

I am observing strange discrepancies between the information present in the VCF created by Pindel and the internal file format. Here is what I did:

and the internal format:

3305 D 96 NT 96 "GTACTTGGGATTACTGGCGTACGCCACCACGCCCAGCTAATTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTCAACCAGGCTGGTCTCGAA" ChrID 1 BP 10290620 10290717 BP_range 10290620 10290717 Supports 1 1 + 0 0 - 1 1 S1 2 SUM_MS 99 1 NumSupSamples 1 1 pFDA_simTruth_76x_0.4_FEMALE 0 0 0 0 1 1

1 10289908 . GGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAAAATTAGGGGCCAGACGTGGTGGCTCACACCTATAATCCCAGC GGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAAAATTAGGGGCCAGACGTGGTGGCTCACACCTATAATCCCAGCTATTCAGGAGGCTGAGGCAGGAGAATCACTTGAACCCAGGAGGTGGAGGTTGCAGTGAGCTGAGATCGCACCACTGCACTCCAGCCTGGGTCACAGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAAAATTAGGGGCCAGACGTGGTGGCTCACACCTATAATCCCAGC . PASS END=10290003;HOMLEN=0;SVLEN=95;SVTYPE=DUP:TANDEM;NTLEN=95 GT:AD 0/0:0,1

168 TD 95 NT 95 "TATTCAGGAGGCTGAGGCAGGAGAATCACTTGAACCCAGGAGGTGGAGGTTGCAGTGAGCTGAGATCGCACCACTGCACTCCAGCCTGGGTCACA" ChrID 1 BP 10289908 10290004 BP_range 10289908 10290004 Supports 1 1 + 0 0 - 1 1 S1 2 SUM_MS 99 1 NumSupSamples 1 1 pFDA_simTruth_76x_0.4_FEMALE 0 0 0 0 1 1

and in this case the (begin, end) coordinates are reference (10289908, 10290003), VCF (10289908, 10290003), and internal format (10289908, 10290004)

The user manual does not explain anything of this, so I am clueless. Any help is appreciated.

EWLameijer commented 6 years ago

If I remember correctly, whereas biologists start a chromosome at position 1, Pindel starts a genome at position 0, pindel2vcf therefore has to 'shift' the raw pindel position 1 place. That may explain the first discrepancy.

In general, the raw pindel output is really 'raw', and pindel2vcf is not only meant as a simple converter too, but also to remove duplicates, shift events that have not been reported at the correct place, etc. Trying to work with the raw Pindel data is something you can do, but is not straightforward.

Note that anyway, working with VCF files isn't very straightforward at all, since while there is an official VCF standard, the details are a bit vague; for example the GATK format is different from the more general VCF format, hence pindel2vcf has a GATK option.

I do understand your rationale for wanting quality data. The easiest way to achieve that (in my opinion) is to use the different filtering options in pindel2vcf so you can select events which all have a certain minimum support etc.; pindel2vcf has lots of filtering options.

Hope this helps!