Representing and benchmarking variant calls on X, Y, and MT

jzook commented 8 years ago

The previous GIAB high-confidence callsets have ignored Y, since NA12878 is a female, they have ignored MT, because we haven't developed methods to call high confidence variants there, and they have treated X the same as the other chromosomes. Our next GIAB genome is a son in a trio, so we will need to treat X and Y differently both in the creation of high-confidence calls and in benchmarking. As I understand, there are a few ways that different tools express variants in males in X and Y.

In pseudo-autosomal regions, call variants as heterozygous or homozygous in X and mask Y. Eventually, I assume long reads, linked reads, or other methods will be able to phase variants and determine which variants are on X and which are on Y, so this might be only a short term solution, but perhaps this is best with current methods?
In other regions of X and Y, for males it seems best to output variants with "1" in the GT field, rather than "1/1", but many variant callers do not do this. Should benchmarking tools allow either representation when doing comparisons and output a warning, or should they reject the "1/1" representation?
Similarly, for MT, should benchmarking tools reject "1/1" in the GT field? How should heteroplasmy be represented?

Lenbok commented 8 years ago

Here is at least a data point:

When using the RTG sex-aware variant calling pipeline, calls on the X and Y chromosomes for a male will be output as haploid calls (apart from the PAR regions, where we produce diploid calls). (I note that when Rutgers ran the RTG trio caller on the AJ trio, they had not enabled the sex-aware calling, so the calls for X and Y were made with a diploid model).

It seems more correct for the truth set to use the correct ploidy in the GT for X and Y. There isn't a downside, at least as far as vcfeval is concerned, as it does not distinguish between a haploid call and a homozygous call, (i.e. "1" is treated the same as "1/1"), so any callers that do not use the haploid representation will not be penalized. (Of course, they will be penalized if they make a heterozygous call though).

jzook commented 8 years ago

The consensus from our call was to follow what @Lenbok proposed above. The PAR regions should be represented as diploid on X, and if it is known which variants are on X and Y in PAR, they should be phased accordingly. In the other regions of X and Y, genotypes should be haploid, but comparison tools should treat homozygous variant as equivalent to a haploid variant (i.e., 1/1 is the same as 1 in the GT field).

ga4gh / benchmarking-tools

Representing and benchmarking variant calls on X, Y, and MT #16