Closed jzook closed 8 years ago
Here is at least a data point:
When using the RTG sex-aware variant calling pipeline, calls on the X and Y chromosomes for a male will be output as haploid calls (apart from the PAR regions, where we produce diploid calls). (I note that when Rutgers ran the RTG trio caller on the AJ trio, they had not enabled the sex-aware calling, so the calls for X and Y were made with a diploid model).
It seems more correct for the truth set to use the correct ploidy in the GT for X and Y. There isn't a downside, at least as far as vcfeval is concerned, as it does not distinguish between a haploid call and a homozygous call, (i.e. "1" is treated the same as "1/1"), so any callers that do not use the haploid representation will not be penalized. (Of course, they will be penalized if they make a heterozygous call though).
The consensus from our call was to follow what @Lenbok proposed above. The PAR regions should be represented as diploid on X, and if it is known which variants are on X and Y in PAR, they should be phased accordingly. In the other regions of X and Y, genotypes should be haploid, but comparison tools should treat homozygous variant as equivalent to a haploid variant (i.e., 1/1 is the same as 1 in the GT field).
The previous GIAB high-confidence callsets have ignored Y, since NA12878 is a female, they have ignored MT, because we haven't developed methods to call high confidence variants there, and they have treated X the same as the other chromosomes. Our next GIAB genome is a son in a trio, so we will need to treat X and Y differently both in the creation of high-confidence calls and in benchmarking. As I understand, there are a few ways that different tools express variants in males in X and Y.