ga4gh / benchmarking-tools

Repository for the GA4GH Benchmarking Team work developing standardized benchmarking methods for germline small variant calls
Apache License 2.0
187 stars 46 forks source link

Records in intermediate VCF format #13

Open bioinformed opened 8 years ago

bioinformed commented 8 years ago

Apologies for jumping into this discussion late. My question is what are the records in the intermediate VCF format? e.g. given two inputs

Truth

1 10 AT GT,AC 0/1

Query

1 10 A G 0/1 1 11 T C 0/1

What does the intermediate output look like for this matching case?

And for this non-matching case:

Truth

1 10 AT GT,AC 0/1

Query

1 10 A G 1/1 1 11 T C 1/1

bioinformed commented 8 years ago

@pkrusche: More questions. For some reason I thought hap.py and xcmp already implemented the consensus intermediate format. Here is the current xcmp output from hap.py for my first example above (using valid ref coordinates):

fileformat=VCFv4.1

...

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TRUTH QUERY

1 50011 . AT GT 1000 . gtt1=gt_het;type=FN;kind=missing;ctype=hap:match;HapMatch GT 0/1 ./. 1 50011 . A G 1000 . gtt2=gt_het;type=FP;kind=missing;ctype=hap:match;HapMatch GT ./. 0/1 1 50012 . T T 1000 . gtt2=gt_het;type=FP;kind=missing;ctype=hap:match;HapMatch GT ./. 0/1

Why is kind equal to FN or FP in any of the records? The superloci match, which to me implies that kind should equal TP in all records. Is this not the case in the consensus intermediate format?

pkrusche commented 8 years ago

@bioinformed : about hap.py / xcmp: they will implement the new intermediate format soon, probably in February (it started out similar to what hap.py is writing, but changed during the discussion).

pkrusche commented 8 years ago

In the matching case, if the comparison tool chooses to not split any input variants, I guess the only way to output the result is to print the records as they were and add "." genotypes to pad. The BDs would be for strict GT comparison:

CHROM POS REF ALT    FORMAT      T          Q
1     10  AT  GT,AC  GT:BK:BD    0/1:gm:TP  .:gm:TP
1     10  A   G      GT:BK:BD    .:gm:TP    0/1:gm:TP
1     11  T   C      GT:BK:BD    .:gm:TP    0/1:gm:TP 

For the mismatch case, it would depend on whether we want to require the comparison tool to be able to pick up a possible allele match. If so, it would probably output this:

CHROM POS REF ALT    FORMAT      T          Q
1     10  AT  GT,AC  GT:BK:BD    0/1:am:FP  .:am:FP
1     10  A   G      GT:BK:BD    .:am:FP    1/1:am:FP
1     11  T   C      GT:BK:BD    .:am:FP    1/1:am:FP 

This gives another corner case by the way if GTs are the other way around:

CHROM POS REF ALT    FORMAT      T          Q
1     10  AT  GT,AC  GT:BK:BD    1/1:lm:FP  .:lm:FP
1     10  A   G      GT:BK:BD    .:lm:FP    0/1:lm:FP
1     11  T   C      GT:BK:BD    .:lm:FP    0/1:lm:FP 

The reason I would go for lm instead of am here is that there is a way to phase the query calls which make the alleles mismatch by putting the SNPs onto different haplotypes.

Does this look reasonable?