Illumina / hap.py

Haplotype VCF comparison tools
Other
402 stars 122 forks source link

FP in phased variant overlapping non-phased variant #146

Open arangrhie opened 3 years ago

arangrhie commented 3 years ago

Hello,

It seems like hap.py does not properly handle phased variants that overlap with non-phased variants.

Here is an example:

chr20   55234351    .   A   T   500 PASS    AC=1;AF=0.5;AN=2;DA=39;DP=78;GM=1;MC=0;MF=0;MQ10=1;MQ20=1;MQ30=1;MQ40=1;MQ50=1;SC=TGTATGTGTAAATATATATAT;set=variant2-variant3   GT:GQ:PS:UG:UQ  1|0:99:55134343:0/1:147.24
chr20   55234351    rs139242639 AAT A,TAT   405.06  PASS    AC=1,1;AF=0.5,0.5;AN=2;DB;DP=17;ExcessHet=3.0103;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;QD=23.83;SOR=1.371;VQSLOD=1.56;culprit=DP;set=variant   GT:AD:DP:GQ:PL  1/2:0,11,6:17:99:422,157,163,307,0,444

and here is what hap.py produced for the same chr20:55234351:

chr20   55234351    .   AAT TAT,A   50  .   BS=55234351;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ  2/1:FN:lm:d1_5,tv:INDEL:hetalt:.    ./.:.:.:.:NOCALL:nocall:0
chr20   55234351    .   A   T   500 .   BS=55234351;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ  ./.:.:.:.:NOCALL:nocall:.   1/1:FP:lm:tv:SNP:homalt:500
chr20   55234351    .   AAT A   405.06  .   BS=55234351;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ  ./.:.:.:.:NOCALL:nocall:.   0/1:FP:lm:d1_5:INDEL:het:405.06

It looks like the 1|0 turned into 1/1 while evaluating, and puts both the variants as FPs.

On the other hand, when a filtered set was provided:

chr20   55234351    rs139242639 AAT A,TAT   405.1   PASS    AC=1,1;AF=0.5,0.5;AN=2;DB;DP=17;ExcessHet=3.0103;FS=0;MLEAC=1,1;MLEAF=0.5,0.5;MQ=60;QD=23.83;SOR=1.371;VQSLOD=1.56;culprit=DP;set=variant   GT:AD:DP:GQ:PL  1/2:0,11,6:17:99:422,157,163,307,0,444

Now the non-phased variant becomes TP:

chr20   55234351    .   AAT TAT,A   50  .   BS=55234351;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ  2/1:TP:gm:d1_5,tv:INDEL:hetalt:405.1    ./.:.:.:.:NOCALL:nocall:0
chr20   55234351    .   A   T   405.1   .   BS=55234351;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ  ./.:.:.:.:NOCALL:nocall:.   0/1:TP:gm:tv:SNP:het:405.1
chr20   55234351    .   AAT A   405.1   .   BS=55234351;Regions=CONF,TS_contained   GT:BD:BK:BI:BVT:BLT:QQ  ./.:.:.:.:NOCALL:nocall:.   0/1:TP:gm:d1_5:INDEL:het:405.1

Is this an expected behavior in hap.py?

I am using

commit 9d128a99e85bacd47f1f5da0e774e92e88b5745a
Merge: 46ee23c ec8172c
Author: Dylan Skola <56270483+dskola@users.noreply.github.com>
Date:   Thu Feb 13 15:21:32 2020 -0800

    Merge pull request #110 from Illumina/issue_109

    Added missing bcftools invocation to command line for region processing

and benchmarked against

HG002_GRCh38_1_22_v4.2.1_benchmark.bcf
HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed

Thanks, Arang

jzook commented 3 years ago

Hi Arang,

Good question! I think hap.py is behaving as expected here, but it's a little complicated. It appears your first vcf was probably merged from multiple callers, so I think it is essentially calling the same heterozygous SNV twice. hap.py tries to find a way to combine these calls as separate variant calls, and the only way to do that is to say the "heterozygous" SNVs are on opposite haplotypes, which essentially makes a homozygous SNV, and it is counted as a FP because it doesn't match the correct heterozygous genotype for the SNV. When you remove this extra SNV from the vcf, then everything matches the benchmark vcf even though it is represented as a single line instead of 2 lines.

Does that make sense? Thanks! Justin

nhansen commented 3 years ago

It's true there is redundancy in the unfiltered example, but the het variant marked as phased is essentially contained in the second (also het) unphased one, so why would hap.py assume it lies on both alleles? If one were to assume it was the same variant, it would make room for the two base deletion contained in the second variant, and everything would be consistent.

Lenbok commented 3 years ago

Does this conversion to homozygous snp happen during preprocessing stages, or is it part of the comparison engine? What happens if you use the vcfeval engine? When I ran this example directly with rtg vcfeval (i.e. outside of hap.py), the ga4gh intermediate output file contains:

chr20   55234351        .       A       T       .       PASS    BS=55234351     GT:QQ:BD:BK     .       1|0:50.0:FP:lm
chr20   55234351        .       AAT     A,TAT   .       PASS    BS=55234351     GT:BD:BK:QQ     1/2:TP:gm       1/2:TP:gm:50.0

Which is exactly what I expect it should do -- the spurious het SNP is classified as FP.