broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

Revert GGVCFs PGT correction in light of better PGT/PID definition #6952

Open ldgauthier opened 3 years ago

ldgauthier commented 3 years ago

There is a point where GenotypeGVCFs corrects PGTs of 0|1 or 1|0 for homozygous variant events, but they do contain different haplotypes in some cases. Revert this "correction" so we don't lose/falsify haplotype information.

As discussed in #6937: Under the model assumed by this code, the phaseGT in the code and PGT format field on each genotype can be interpreted as being an indicator of which of the two phased haplotypes in the sample contains the site-specific alternate allele at the site (ie. excluding which represents variation that beings upstream of the current variant. NB that this results in cases where PGT is not the same as the phased GT field. For example, in the case of a spanned SNP site with REF allele A and alt alleles C and , GT may be set to 1|2 to represent the spanned SNP, while PGT would be set to 1|0 to represent the fact that it is the first haplotype in the pair of phased haplotypes that contains the site-specific alt allele (in this case C).

ldgauthier commented 3 years ago

But do drop phasing for hom ref sites by default so users aren't confused. Maybe add a flag --retain-all-phasing-info to keep it.