Open bioinformed opened 8 years ago
@pkrusche: More questions. For some reason I thought hap.py
and xcmp
already implemented the consensus intermediate format. Here is the current xcmp
output from hap.py
for my first example above (using valid ref coordinates):
fileformat=VCFv4.1
...
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TRUTH QUERY
1 50011 . AT GT 1000 . gtt1=gt_het;type=FN;kind=missing;ctype=hap:match;HapMatch GT 0/1 ./. 1 50011 . A G 1000 . gtt2=gt_het;type=FP;kind=missing;ctype=hap:match;HapMatch GT ./. 0/1 1 50012 . T T 1000 . gtt2=gt_het;type=FP;kind=missing;ctype=hap:match;HapMatch GT ./. 0/1
Why is kind
equal to FN
or FP
in any of the records? The superloci match, which to me implies that kind
should equal TP
in all records. Is this not the case in the consensus intermediate format?
@bioinformed : about hap.py / xcmp: they will implement the new intermediate format soon, probably in February (it started out similar to what hap.py is writing, but changed during the discussion).
In the matching case, if the comparison tool chooses to not split any input variants, I guess the only way to output the result is to print the records as they were and add "." genotypes to pad. The BDs would be for strict GT comparison:
CHROM POS REF ALT FORMAT T Q
1 10 AT GT,AC GT:BK:BD 0/1:gm:TP .:gm:TP
1 10 A G GT:BK:BD .:gm:TP 0/1:gm:TP
1 11 T C GT:BK:BD .:gm:TP 0/1:gm:TP
For the mismatch case, it would depend on whether we want to require the comparison tool to be able to pick up a possible allele match. If so, it would probably output this:
CHROM POS REF ALT FORMAT T Q
1 10 AT GT,AC GT:BK:BD 0/1:am:FP .:am:FP
1 10 A G GT:BK:BD .:am:FP 1/1:am:FP
1 11 T C GT:BK:BD .:am:FP 1/1:am:FP
This gives another corner case by the way if GTs are the other way around:
CHROM POS REF ALT FORMAT T Q
1 10 AT GT,AC GT:BK:BD 1/1:lm:FP .:lm:FP
1 10 A G GT:BK:BD .:lm:FP 0/1:lm:FP
1 11 T C GT:BK:BD .:lm:FP 0/1:lm:FP
The reason I would go for lm
instead of am
here is that there is a way to phase the query calls which make the alleles mismatch by putting the SNPs onto different haplotypes.
Does this look reasonable?
Apologies for jumping into this discussion late. My question is what are the records in the intermediate VCF format? e.g. given two inputs
Truth
Query
What does the intermediate output look like for this matching case?
And for this non-matching case:
Truth
Query