Illumina / hap.py

Haplotype VCF comparison tools
Other
402 stars 122 forks source link

VCF format issues with --write-vcf, - FORMAT field inconsistencies #175

Open insilicool opened 1 year ago

insilicool commented 1 year ago

Hi,

I've been attempting to parse the vcfs generated by the --write-vcf option into R and encountering issues. I believe this has to do with the inconsistency in the number of fields in the FORMAT section.

e.g. with the CMRG truth set

module load mugqic_dev/python/2.7.13 && \ 0.3.15/hap.py \ --threads 3 --write-vcf --decompose --preprocess-truth \ --engine vcfeval \ HG002_GRCh37_CMRG_smallvar_v1.00.vcf.gz \ HG002_GRCh37_CMRG_smallvar_v1.00.vcf.gz \ -f HG002_GRCh37_CMRG_smallvar_v1.00.bed \ -r Homo_sapiens.hs37d5.fa \ --engine-vcfeval-template Homo_sapiens.hs37d5.SDF \ -o comparisons_0.3.15/HG002-cmrg-happy_0.3.15

This results in variants looking like this: bcftools query -f '%CHROM\t%POS\t%FORMAT\n' HG002-cmrg-happy_0.3.15.vcf.gz

Problematic calls: 21 47407748 GT:BD:BI:BVT:BLT:QQ 0/1:N:ti:SNP:het:. 0/1:N:ti:SNP:het:0 21 47407754 GT:BD:BI:BVT:BLT:QQ 0/1:N:i1_5:INDEL:het:. 0/1:N:i1_5:INDEL:het:0

Correctly formatted: 21 47408099 GT:BD:BK:QQ:BI:BVT:BLT 1/1:TP:gm:30:tv:SNP:homalt 1/1:TP:gm:30:tv:SNP:homalt 21 47408101 GT:BD:BK:QQ:BI:BVT:BLT 0/1:TP:gm:30:i16_plus:INDEL:het 0/1:TP:gm:30:i16_plus:INDEL:het

As you can see BK field is missing and QQ field is in a different location. Additionally, I don't see documentation explaining what a BD ="N" represents.

I first observed this in 0.3.12 but it persists in 0.3.15 as well.

Thanks for any help and insight you can provide.

YuanTian1991 commented 9 months ago

My R code to parse the output:


library("data.table")
library("tidyverse")
library("reshape2")

message("Reading VCF...")
happy_vcf <- fread(happy_vcf_path, sep='\t', header = TRUE, skip = '#CHROM') %>%
    .[,INFO:=NULL]

  format_types <- unique(happy_vcf$FORMAT)

  truth <- list()
  query <- list()
  for(i in format_types) {
    message("Parsing ", i, "...")
    index <- which(happy_vcf$FORMAT == i)
    col_names <- str_split(i, ":")[[1]]

    truth[[i]] <- colsplit(happy_vcf[index, TRUTH], ":", names=col_names) %>% 
      setDT %>% .[,index:=index]

    query[[i]] <- colsplit(happy_vcf[index, QUERY], ":", names=col_names) %>% 
      setDT %>% .[,index:=index]
  }

  truth <- rbindlist(truth, fill = TRUE) %>% .[order(index), ]
  query <- rbindlist(query, fill = TRUE) %>% .[order(index), ]