Illumina / hap.py

Haplotype VCF comparison tools
Other
420 stars 123 forks source link

Potential bug with hom-alt truth and multiple het candidate representation #38

Open depristo opened 6 years ago

depristo commented 6 years ago

Hi hap.py authors:

I think I may have run into a bug in happy, but I'm not 100% certain. Here's the example:

query: 20 57777379 . C CT 0 . . GT 0/1 20 57777394 . TA T 0 . . GT 0/1 20 57777395 . A T 0 . . GT 0/1

truth: 20 57777395 rs3899824 A T 50 PASS . GT:DP:ADALL:AD:GQ:IGT:I\ PS:PS 1|1:639:13,200:12,179:293:1/1:.:HOMVAR

ref: CCTTTTTTTTTTTTTTTAAAAAAAAAAACACATCTAAT starting at 57777378

hap.py output 20 57777379 . C CT 0 . BS=57777379;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:lm:i1_5:INDEL:het:0 20 57777394 . TA T 0 . BS=57777379;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:lm:d1_5:INDEL:het:0 20 57777395 . A T 50 . BS=57777379;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ 1/1:FN:am:tv:SNP:homalt:. 0/1:FP:am:tv:SNP:het:0

I know this representation isn't ideal, but what is happening here is:

-- query has 0/1 for A/T @ 57777395, which is 50% of the right answer for the truth hom-alt. -- query has a het C=>CT followed by TA=>T, which in effect replaces one of As in the run of Ts with one more T. This has the same effect on the haplotype as the A => T SNP, and since both are het this should produce two matching haplotypes.

My view is that happy should emit no FPs and no FNs here, as these two representations are in fact identical. Seems like a bug to me.

Here's how I run hap.py:

docker run -it -v "${ROOT}:${ROOT}" -v "${OUTPUT_DIR}:${OUTPUT_DIR}" pkrusche/hap.py /opt/hap.py/bin/hap.py "${TRUTH_VCF}" "${NEW_LABELER_VCF}" --preprocess-truth --no-adjust-conf-regions -f "${TRUTH_BED}" -r "${REF%.gz}" -o "${OUTPUT_DIR}/new_labeler.happy.output" -l ${REGION}

Lenbok commented 6 years ago

Try using the vcfeval engine. It will find the match as long as --ref-overlap is used (which I believe that hap.py does).

##fileformat=VCFv4.1
##fileDate=20180203
##source=RTG Tools 3.8.4 / Core ad642ff (2017-09-08)
##CL=vcfeval -t /rtgshare/data/human/ref/1000g_v37_phase2/sdf -b hap-38.vcf.gz -c hap-38.vcf.gz --sample HG002,DV --region 20 -o hap-38 -m ga4gh --ref-overlap
##INFO=<ID=BS,Number=.,Type=Integer,Description="Benchmarking superlocus ID for these variants">
##INFO=<ID=CALL_WEIGHT,Number=1,Type=Float,Description="Call weight (equivalent number of truth variants). When unspecified, assume 1.0">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=BD,Number=1,Type=String,Description="Decision for call (TP/FP/FN/N)">
##FORMAT=<ID=BK,Number=1,Type=String,Description="Sub-type for decision (match/mismatch type). (Loose match distance is 30)">
##FORMAT=<ID=BI,Number=1,Type=String,Description="Additional comparison information">
##FORMAT=<ID=QQ,Number=1,Type=Float,Description="Variant quality for ROC creation">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TRUTH   QUERY
20      57777379        .       C       CT      .       .       BS=57777380     GT:BD:BK        .:N     0/1:TP:gm
20      57777394        .       TA      T       .       .       BS=57777380     GT:BD:BK        .:N     0/1:TP:gm
20      57777395        .       A       T       .       .       BS=57777380     GT:BD:BK        1|1:TP:gm       0/1:TP:gm