Benchmark new GATK germline sub workflow for accuracy and precision

Results: Sample HG002

Benchmarking command:

# Extract sample HG002 from the
# multi-sample VCF, hap.py does 
# not support multi-sample VCFs
# and our truthset is for HG002.
module purge
module load bcftools
bcftools view \
    -s HG002 \
    snps_and_indels_recal_refinement_variants.vcf.gz \
    -o HG002.snps_indels.haplotypecaller.recal.refined.vcf.gz \
    -Oz

# Run hap.py to evaluate the 
# performance of the new sub-
# workflow & HaplotypeCaller.
module purge
module load singularity
singularity run -B $PWD /path/to/hap.py_latest.sif \
    /opt/hap.py/bin/hap.py \
        --threads 12 \
        -o HG002_benchmarking_results \
        -r Homo_sapiens_assembly38.fasta \
        -f truthset/HG002_GRCh38_1_22_v4.1_draft_benchmark.bed \
       truthset/HG002_GRCh38_1_22_v4.1_draft_benchmark.vcf.gz \
       HG002.snps_indels.haplotypecaller.recal.refined.vcf.gz

Benchmarking summary from hap.py:

 Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL    ALL       526124    521041      5083       958334      5739     408419   1302       0.990339          0.989564        0.426176         0.989951                     NaN                     NaN                   1.528212                   2.024520
INDEL   PASS       526124    520564      5560       944090      4666     395693   1285       0.989432          0.991492        0.419126         0.990461                     NaN                     NaN                   1.528212                   1.986641
  SNP    ALL      3365379   3339562     25817      3979159     37213     600838   4292       0.992329          0.988985        0.150996         0.990654                2.099711                1.952641                   1.580978                   1.745183
  SNP   PASS      3365379   3233487    131892      3521165      6736     279410    996       0.960809          0.997922        0.079352         0.979014                2.099711                2.053189                   1.580978                   1.618381

OpenOmics / genome-seek

Benchmark new GATK germline sub workflow for accuracy and precision #21

Results: Sample HG002