VCF Comparison with same file does not yield perfect result

Illumina / hap.py

Haplotype VCF comparison tools

Other

402 stars 122 forks source link

VCF Comparison with same file does not yield perfect result #143

Open itsroops opened 3 years ago

itsroops commented 3 years ago

I am currently doing benchmarking of the VCF files with the gold standard file HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_PGandRTGphasetransfer.vcf. While I am giving both the query as well as the truth as the GIAB VCF file themselves, I am getting a perfect score of precision, recall, and f1. However, for any other random VCF files which are compared with themselves, I am not getting a perfect score rather am getting around 90%.

Is this normal behaviour or a there a different reason? Is the gold standard file hardcoded in the hap.py?

serge2016 commented 3 years ago

Same here

Lenbok commented 3 years ago

That usually means the vcf in question is not self-consistent, e.g. not all of the asserted variants can be simultaneously possible in a diploid genome. In a few cases it can be complex situations that vcfeval might resolve that xcmp misses. Are you using xcmp or the vcfeval engine?

You can look at the specific FP variants and they will be nearby other variants that occupy/overlap the same reference coordinates.

itsroops commented 3 years ago

Thank you for the answer. I am not using any other tool for benchmarking but only hap.py. One quick question: Do we need to preprocess the VCF files by running any python code before passing to hap.py ? I am just using hap.py truth.vcf query.vcf -f confident.bed -o output_prefix -r reference.fa command for comparison

TimD1 commented 2 years ago

Adding the option --engine vcfeval usually improves detection of trickier variants. If there are any partially overlapping calls, this might be able to detect them using a combination of this and --ref-overlap.

cwarden45 commented 11 months ago

I can reproduce this error, but it looks like it affects a somewhat small fraction of variants.

For example, I have attached the summary of analyzing the Genome In A Bottle (GIAB) v4.2.1 benchmark with itself (HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz, from https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/latest/GRCh38/):

CALLABLE-GAIB_v4.2.1-vs-GAIB_v4.2.1.summary.csv

I generated that result with the following command:

singularity exec --bind $MOUNTIN:$MOUNTOUT hap.py_latest.sif /opt/hap.py/bin/hap.py $REF $TEST -f $BED -o $OUT -r $FA

This uses HG002_GRCh38_1_22_v4.2.1_callablemultinter_gt0.bed.gz to define the regions to test (even if I think _HG002_GRCh38_1_22_v4.2.1_benchmarknoinconsistent.bed.gz is often used).

wcarre commented 3 months ago

Hi, How do you use --ref-overlap ? It's not an option of hap.py. Is it in a config file ?