Potential bug with hom-alt truth and multiple het candidate representation

Illumina / hap.py

Haplotype VCF comparison tools

Other

420 stars 123 forks source link

Hi hap.py authors:

I think I may have run into a bug in happy, but I'm not 100% certain. Here's the example:

query: 20 57777379 . C CT 0 . . GT 0/1 20 57777394 . TA T 0 . . GT 0/1 20 57777395 . A T 0 . . GT 0/1

truth: 20 57777395 rs3899824 A T 50 PASS . GT:DP:ADALL:AD:GQ:IGT:I\ PS:PS 1|1:639:13,200:12,179:293:1/1:.:HOMVAR

ref: CCTTTTTTTTTTTTTTTAAAAAAAAAAACACATCTAAT starting at 57777378

hap.py output 20 57777379 . C CT 0 . BS=57777379;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:lm:i1_5:INDEL:het:0 20 57777394 . TA T 0 . BS=57777379;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:lm:d1_5:INDEL:het:0 20 57777395 . A T 50 . BS=57777379;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ 1/1:FN:am:tv:SNP:homalt:. 0/1:FP:am:tv:SNP:het:0

I know this representation isn't ideal, but what is happening here is:

-- query has 0/1 for A/T @ 57777395, which is 50% of the right answer for the truth hom-alt. -- query has a het C=>CT followed by TA=>T, which in effect replaces one of As in the run of Ts with one more T. This has the same effect on the haplotype as the A => T SNP, and since both are het this should produce two matching haplotypes.

My view is that happy should emit no FPs and no FNs here, as these two representations are in fact identical. Seems like a bug to me.

Here's how I run hap.py:

docker run -it -v "${ROOT}:${ROOT}" -v "${OUTPUT_DIR}:${OUTPUT_DIR}" pkrusche/hap.py /opt/hap.py/bin/hap.py "${TRUTH_VCF}" "${NEW_LABELER_VCF}" --preprocess-truth --no-adjust-conf-regions -f "${TRUTH_BED}" -r "${REF%.gz}" -o "${OUTPUT_DIR}/new_labeler.happy.output" -l ${REGION}

##fileformat=VCFv4.1 ##fileDate=20180203 ##source=RTG Tools 3.8.4 / Core ad642ff (2017-09-08) ##CL=vcfeval -t /rtgshare/data/human/ref/1000g_v37_phase2/sdf -b hap-38.vcf.gz -c hap-38.vcf.gz --sample HG002,DV --region 20 -o hap-38 -m ga4gh --ref-overlap ##INFO=<ID=BS,Number=.,Type=Integer,Description="Benchmarking superlocus ID for these variants"> ##INFO=<ID=CALL_WEIGHT,Number=1,Type=Float,Description="Call weight (equivalent number of truth variants). When unspecified, assume 1.0"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=BD,Number=1,Type=String,Description="Decision for call (TP/FP/FN/N)"> ##FORMAT=<ID=BK,Number=1,Type=String,Description="Sub-type for decision (match/mismatch type). (Loose match distance is 30)"> ##FORMAT=<ID=BI,Number=1,Type=String,Description="Additional comparison information"> ##FORMAT=<ID=QQ,Number=1,Type=Float,Description="Variant quality for ROC creation"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TRUTH QUERY 20 57777379 . C CT . . BS=57777380 GT:BD:BK .:N 0/1:TP:gm 20 57777394 . TA T . . BS=57777380 GT:BD:BK .:N 0/1:TP:gm 20 57777395 . A T . . BS=57777380 GT:BD:BK 1|1:TP:gm 0/1:TP:gm

Illumina / hap.py

Potential bug with hom-alt truth and multiple het candidate representation #38