Illumina / hap.py

Haplotype VCF comparison tools
Other
407 stars 124 forks source link

Incorrect FN assignment in latest hap.py #33

Open depristo opened 6 years ago

depristo commented 6 years ago

Hi Happy authors,

We've been using hap.py for our internal DeepVariant evaluations and recently did a deep dive into some of the FPs and FNs and discovered what looks like a bug in happy. Here's the situation (these are using the b37 truth from GIAB for HG002 if that's useful):

Candidate variant call:
20      16392574        .       C       CGTGT   0       .       .       GT      1/0

truth:
20      16392574        .       CGTGTGTGT       CGTGTGTGTGTGT,C 50      PASS    .     GT:DP:ADALL:AD:GQ:IGT:IPS:PS    1|2:259:0,33,47:2,43,61:297:2|1:.:PATMAT

happy vcf output:
20      16392574        .       CGTGTGTGT       C       50      .       BS=16392574;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ  0/1:FN:lm:d6_15:INDEL:het:. ./.:.:.:.:NOCALL:nocall:0
20      16392574        .       C       CGTGT   0       .       BS=16392574;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ  ./.:.:.:.:NOCALL:nocall:.  0/1:FP:lm:i1_5:INDEL:het:0
20      16392582        .       T       TGTGT   50      .       BS=16392574;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ  0/1:FN:lm:i1_5:INDEL:het:. ./.:.:.:.:NOCALL:nocall:0

In this situation we have a multi-allelic het-alt in truth and a candidate variant that contains only one of the two alleles (C=>CGTGT is the same as CGTGTGTGT=>CGTGTGTGTGTGT since we can remove the suffix GTGTGTGT to produce C=>CGTGT). However, in the happy VCF output you can see it is saying there's on FN (for the deletion allele), one FP for the hom-ref assignment for the C=>CGTGT allele, and one extra and incorrect FN allele T=>TGTGT. This last allele is actually the same as the C=>CGTGT but it's been put at an unusual position (16392582 not 16392574) as the reference in this region is a repeat of lots of GT dinucleotides.

It's unclear to use if this is just a problem with the VCF emitter, but if it affects the actual TP/FP/FN calculations then it's definitely much more serious.

All the best,

Mark

pkrusche commented 6 years ago

Thanks for reporting this issue!

I think this should be specific to the xcmp comparison engine that hap.py uses by default -- using the --engine=vcfeval command line option should fix this. This is the default in the GA4GH benchmarking recommendations and hap.py precisionFDA app + will also be in future versions of hap.py.

Xcmp can only match variants if they either match by record or if the entire locus matches. Hap.py can use vcfeval internally to do the comparison, which might be able to match these records up since it uses branch-and-bound optimisation to find the best variant to haplotype playback.

The representation / left-shifting issue is related to hap.py trying to avoid changing the possible haplotypes a VCF file can represent: it won't left-shift past the reference end of the previous variant. The current version doesn't handle phasing very well and it doesn't recognize that it could actually shift this particular variant further. We do have an improved normalization engine which we will release soon.

Lenbok commented 6 years ago

I made a quick test case for these input variants and vcfeval will output the following ga4gh intermediate formatted match (give or take other annotations):

20      16392574        .       C       CGTGT   .       .       BS=16392575     GT:BD:QQ:BK     .:N     1/0:FP:0.0:am
20      16392574        .       CGTGTGTGT       CGTGTGTGTGTGT,C .       PASS    BS=16392575     GT:BD:BK:BI     1|2:FN:am:multi .:N

So the am indicates that an allele match is found, as expected.