Closed ldgauthier closed 6 years ago
Some clarification questions. For the second input VCF:
Sorry for the incomplete bug report.
The expectation is that the genotype call matches the PLs. The PLs for sample1 at chr20: 1274367 are [591,166,123,542,171,547], with the lowest number being the lowest likelihood of being incorrect. Those positions correspond to [1/1, 1/2, 2/2, 1/3, 2/3, 3/3], so the lowest is actually at 2/2. I agree that the PLs are not accurate. That's an issue from the GATK code that you faithfully copied where we only allow one spanning deletion allele and pick the "best" one using the likelihoods. But given that those are the PLs, the GT should agree.
The VCF spec doesn't seem to specify that the smaller allele number comes first in the genotype (http://samtools.github.io/hts-specs/VCFv4.2.pdf section 1.4.2), but that's been the convention for a long time. Thus, we'd prefer 2/3 to 3/2. I see that you're trying to preserve the order of the alleles from the original variant, but that only matters if the genotype is phased (e.g. 3|2 is permissible). The zeros where data is missing should be ./., which is justified by the specification about missing values being specified with a dot. So if neither allele of the genotype is known it goes to ./.
Some more questions/clarifications:
After poring over the VCF 4.2 spec, it looks like is it NOT required that the GT call match the min PL. It's a convention many tools follow, but not mandatory. The ordering of the alleles is not described in the spec either, so I'm wrong about everything but the 0.
Would your preference be to use the min PL genotype for the GT field in the spanning deletion? This can be done
Personally my preference would be to use the min PL genotype for the GT field, but this mode isn't used in GATK so that may not be important.
fixed
Importing sample1:
and sample2:
into a GDB and querying it with PRODUCE_GT_FIELD set to true produces:
Note that the GT for the first sample at the last position is output as 2/1 (I'm not sure if that's against the spec, but it's certainly against convention) and after examining the PLs, the best likelihood is actually on the 2/2 genotype.