Parsing results VCF gives different counts of TRUTH FN than summary

Illumina / hap.py

Haplotype VCF comparison tools

Other

402 stars 122 forks source link

docker run -v /shared:/shared pkrusche/hap.py /opt/hap.py/bin/hap.py /shared/GIAB_orig/NA12878_GRCh38_1_22_v4.2.1_benchmark.vcf.gz /shared/1KG_reference/NA12878_thinned.vcf.gz --engine vcfeval --engine-vcfeval-template /shared/reference/GRCh38_full_analysis_set_plus_decoy_hla.sdf -o /shared/happy_jobs/GIAB_vs_1KG/results/1kg_vs_giab -r /shared/reference/GRCh38_full_analysis_set_plus_decoy_hla.fa -V

Benchmarking Summary: Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio INDEL ALL 536504 62 536442 0 0 0 0 0.000116 0.000000 0 0.000000 NaN NaN 1.445053 NaN INDEL PASS 536504 62 536442 0 0 0 0 0.000116 0.000000 0 0.000000 NaN NaN 1.445053 NaN SNP ALL 3356592 3284296 72296 3523069 238742 0 2322 0.978461 0.932235 0 0.954789 2.105193 2.063462 1.525983 1.573469 SNP PASS 3356592 3284296 72296 3523069 238742 0 2322 0.978461 0.932235 0 0.954789 2.105193 2.063462 1.525983 1.573469

#!/usr/bin/env python from cyvcf2 import VCF import sys # Description="Decision for call (TP/FP/FN/N)" # name output files truth_fn=open("truth_fn.vcf", 'w') query_fp=open("query_fp.vcf", 'w') truth_tp=open("truth_tp.vcf", 'w') unexplained=open("unexplained.vcf", 'w') # loop over VCF file and get the FP and FN variants for variant in VCF(sys.argv[1]): # skip anything that isn't a SNP if 'SNP' not in variant.format('BVT'): continue # true positives elif variant.format('BD')[0] == "TP": print(variant, file = truth_tp, end = '') # query false positives elif variant.format('BD')[1] == "FP": k +=1 print(variant, file = query_fp, end = '') # truth false negatives elif variant.format('BD')[0] == "FN": print(variant, file = truth_fn, end = '') # anyting else else: print(variant, file = unexplained, end = '')

@BrianLohman I am also parsing the annotated VCF from hap.py because I found it confusing to have a position with FN in the TRUTH field but with FP in the QUERY field. I figured out how add up the counts from my parsing script to have my data match the STDOUT, but I'm still getting different values for QUERY.TOTAL and TRUTH.TOTAL which doesn't make sense to me. Perhaps you or someone else sees something I am missing?

For example, here is my STDOUT:

Benchmarking Summary:
Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       525370    519748      5622       543056      1386          0   1169    144       0.989299          0.997448             0.0         0.993357                     NaN                     NaN                   1.528472                   1.744304
INDEL   PASS       525370    519748      5622       543056      1386          0   1169    144       0.989299          0.997448             0.0         0.993357                     NaN                     NaN                   1.528472                   1.744304
  SNP    ALL      3365341   3337928     27413      3341865      2187          0   1188    245       0.991854          0.999346             0.0         0.995586                2.100079                2.099354                   1.581145                   1.573484
  SNP   PASS      3365341   3337928     27413      3341865      2187          0   1188    245       0.991854          0.999346             0.0         0.995586                2.100079                2.099354                   1.581145                   1.573484

SUM

TRUTH.TOTAL    TRUTH.TP    TRUTH.FN    QUERY.TOTAL    QUERY.FP
3890711        3857676        66035        3884921        3573

Rather than simply counting values in the TRUTH or QUERY column, I joined the column values with a _. As I read in the file, I added the TRUTH_QUERY combo as a key in a Python dictionary the first time and set the value = 0. And then any time after that I +1 to the dictionary value.

And here is what my parsing script returns:

TotalTruthLoci,4048333
._FP,1270
._N,0
._TP,24560
FN_.,30733
FN_FP,2296
FN_N,0
FN_TP,6
N_.,0
N_N,0
TP_.,887
TP_FP,7
TP_N,0
TP_TP,3856782
UNK_UNK,0
TotalTestLoci,3916541

If I then do the following:

TotalTP=(TP_FP + TP_. + TP_TP)
TotalFN=(FN_. + FN_FP + FN_TP)
TotalFP=(._FP + FN_FP + TP_FP)

My output matches the hap.py STDOUT:

TotalTP = 3857676
TotalFN = 30035
TotalFP = 3753

For the math to add up, I have to exclude all ._TP patterns, which mostly seem to be repeats of the same variant also labeled TP_.. The redudancy occurs with both SNPs and INDELs.

1       418660  .       .       NOCALL  nocall  .       TP      0/1     INDEL   het     d1_5
1       418660  TP      0/1     INDEL   het     d1_5    .       .       NOCALL  nocall  .

But there are also similar redundancies with FN_FP, so I'm not sure why these are not excluded either:

1       796491  .       .       NOCALL  nocall  .       FP      0/1     INDEL   het     d6_15
KEY: FN_._INDEL_NOCALL   VALUE:69926
1       796491  FN      0/1     INDEL   het     d6_15   .       .       NOCALL  nocall  .

Where my script currently breaks is that I end up with 31,620 more QUERY.TOTAL variants than the STDOUT. Even subtracting the 24,560 ._TP variants leaves me still with 7,060 more QUERY.TOTAL than STDOUT. I also have 157,622 more variants for TRUTH.TOTAL.

Let me know if you have any thoughts. My processing module is more complex than what you shared, as it's a customized intermediate step in a much larger pipeline. However, if you'd be interested in using it, I could try to make a portable helper script, once I know the math is correct.

Illumina / hap.py

Parsing results VCF gives different counts of TRUTH FN than summary #170