Open depristo opened 6 years ago
Try using the vcfeval engine. It will find the match as long as --ref-overlap is used (which I believe that hap.py does).
##fileformat=VCFv4.1
##fileDate=20180203
##source=RTG Tools 3.8.4 / Core ad642ff (2017-09-08)
##CL=vcfeval -t /rtgshare/data/human/ref/1000g_v37_phase2/sdf -b hap-38.vcf.gz -c hap-38.vcf.gz --sample HG002,DV --region 20 -o hap-38 -m ga4gh --ref-overlap
##INFO=<ID=BS,Number=.,Type=Integer,Description="Benchmarking superlocus ID for these variants">
##INFO=<ID=CALL_WEIGHT,Number=1,Type=Float,Description="Call weight (equivalent number of truth variants). When unspecified, assume 1.0">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=BD,Number=1,Type=String,Description="Decision for call (TP/FP/FN/N)">
##FORMAT=<ID=BK,Number=1,Type=String,Description="Sub-type for decision (match/mismatch type). (Loose match distance is 30)">
##FORMAT=<ID=BI,Number=1,Type=String,Description="Additional comparison information">
##FORMAT=<ID=QQ,Number=1,Type=Float,Description="Variant quality for ROC creation">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TRUTH QUERY
20 57777379 . C CT . . BS=57777380 GT:BD:BK .:N 0/1:TP:gm
20 57777394 . TA T . . BS=57777380 GT:BD:BK .:N 0/1:TP:gm
20 57777395 . A T . . BS=57777380 GT:BD:BK 1|1:TP:gm 0/1:TP:gm
Hi hap.py authors:
I think I may have run into a bug in happy, but I'm not 100% certain. Here's the example:
query: 20 57777379 . C CT 0 . . GT 0/1 20 57777394 . TA T 0 . . GT 0/1 20 57777395 . A T 0 . . GT 0/1
truth: 20 57777395 rs3899824 A T 50 PASS . GT:DP:ADALL:AD:GQ:IGT:I\ PS:PS 1|1:639:13,200:12,179:293:1/1:.:HOMVAR
ref: CCTTTTTTTTTTTTTTTAAAAAAAAAAACACATCTAAT starting at 57777378
hap.py output 20 57777379 . C CT 0 . BS=57777379;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:lm:i1_5:INDEL:het:0 20 57777394 . TA T 0 . BS=57777379;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ ./.:.:.:.:NOCALL:nocall:. 0/1:FP:lm:d1_5:INDEL:het:0 20 57777395 . A T 50 . BS=57777379;Regions=CONF,TS_contained GT:BD:BK:BI:BVT:BLT:QQ 1/1:FN:am:tv:SNP:homalt:. 0/1:FP:am:tv:SNP:het:0
I know this representation isn't ideal, but what is happening here is:
-- query has 0/1 for A/T @ 57777395, which is 50% of the right answer for the truth hom-alt. -- query has a het C=>CT followed by TA=>T, which in effect replaces one of As in the run of Ts with one more T. This has the same effect on the haplotype as the A => T SNP, and since both are het this should produce two matching haplotypes.
My view is that happy should emit no FPs and no FNs here, as these two representations are in fact identical. Seems like a bug to me.
Here's how I run hap.py:
docker run -it -v "${ROOT}:${ROOT}" -v "${OUTPUT_DIR}:${OUTPUT_DIR}" pkrusche/hap.py /opt/hap.py/bin/hap.py "${TRUTH_VCF}" "${NEW_LABELER_VCF}" --preprocess-truth --no-adjust-conf-regions -f "${TRUTH_BED}" -r "${REF%.gz}" -o "${OUTPUT_DIR}/new_labeler.happy.output" -l ${REGION}