Intel-HLS / GenomicsDB

GenomicsDB
Other
111 stars 28 forks source link

Called GTs are poorly formatted and sometimes inaccurate #161

Closed ldgauthier closed 6 years ago

ldgauthier commented 6 years ago

Importing sample1:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1
chr20   1274366 .       A       G,<NON_REF>     562.77  .       DP=26;MQRankSum=-1.011;MQ_DP=26;QUALapprox=591;RAW_MQ=93600.00;ReadPosRankSum=0.578;VarDP=24    GT:AD:DP:GQ:PL:SB       1/1:1,16,0:16:99:368,166,0,379,171,547:1,0,12,11

and sample2:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample2
chr20   1274365 .       TAA     T,TA,<NON_REF>  562.77  .       DP=26;MQRankSum=-1.011;MQ_DP=26;QUALapprox=591;RAW_MQ=93600.00;ReadPosRankSum=0.578;VarDP=24    GT:AD:DP:GQ:PL:SB       1/2:1,7,16,0:24:99:591,353,368,166,0,123,542,379,171,547:1,0,12,11

into a GDB and querying it with PRODUCE_GT_FIELD set to true produces:

chr20   1274365 .       TAA     T,TA,<NON_REF>  .       .       DP=26;MQRankSum=-1.011e+00;MQ_DP=26;QUALapprox=591;RAW_MQ=93600.00;ReadPosRankSum=0.578;VarDP=24        GT:AD:DP:GQ:PL:SB       0       1/2:1,7,16,0:24:99:591,353,368,166,0,123,542,379,171,547:1,0,12,11
chr20   1274366 .       A       G,*,<NON_REF>   .       .       DP=52;MQRankSum=-1.011e+00;MQ_DP=26;QUALapprox=591;RAW_MQ=93600.00;ReadPosRankSum=0.578;VarDP=24        GT:AD:DP:GQ:PL:SB       1/1:1,16,0,0:16:99:368,166,0,379,171,547,379,171,547,547:1,0,12,11      3/2:1,0,16,0:24:99:591,542,547,166,171,123,542,547,171,547:1,0,12,11
chr20   1274367 .       A       *,<NON_REF>     .       .       DP=26   GT:AD:DP:GQ:PL:SB       0       2/1:1,16,0:24:99:591,166,123,542,171,547:1,0,12,11

Note that the GT for the first sample at the last position is output as 2/1 (I'm not sure if that's against the spec, but it's certainly against convention) and after examining the PLs, the best likelihood is actually on the 2/2 genotype.

kgururaj commented 6 years ago

Some clarification questions. For the second input VCF:

ldgauthier commented 6 years ago

Sorry for the incomplete bug report.

The expectation is that the genotype call matches the PLs. The PLs for sample1 at chr20: 1274367 are [591,166,123,542,171,547], with the lowest number being the lowest likelihood of being incorrect. Those positions correspond to [1/1, 1/2, 2/2, 1/3, 2/3, 3/3], so the lowest is actually at 2/2. I agree that the PLs are not accurate. That's an issue from the GATK code that you faithfully copied where we only allow one spanning deletion allele and pick the "best" one using the likelihoods. But given that those are the PLs, the GT should agree.

The VCF spec doesn't seem to specify that the smaller allele number comes first in the genotype (http://samtools.github.io/hts-specs/VCFv4.2.pdf section 1.4.2), but that's been the convention for a long time. Thus, we'd prefer 2/3 to 3/2. I see that you're trying to preserve the order of the alleles from the original variant, but that only matters if the genotype is phased (e.g. 3|2 is permissible). The zeros where data is missing should be ./., which is justified by the specification about missing values being specified with a dot. So if neither allele of the genotype is known it goes to ./.

kgururaj commented 6 years ago

Some more questions/clarifications:

ldgauthier commented 6 years ago

After poring over the VCF 4.2 spec, it looks like is it NOT required that the GT call match the min PL. It's a convention many tools follow, but not mandatory. The ordering of the alleles is not described in the spec either, so I'm wrong about everything but the 0.

kgururaj commented 6 years ago

Would your preference be to use the min PL genotype for the GT field in the spanning deletion? This can be done

ldgauthier commented 6 years ago

Personally my preference would be to use the min PL genotype for the GT field, but this mode isn't used in GATK so that may not be important.

kgururaj commented 6 years ago

fixed