brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
357 stars 55 forks source link

Tolerant VCF parsing #156

Open drtconway opened 1 year ago

drtconway commented 1 year ago

G'day. Thanks for making a nice tool.

I'm trying to use vcfanno (0.3.5, linux binary) with a large combined VCF of gnomad v3.1. The combined bgzipped file is ~2TB, so obviously manipulating it is inconvenient at best.

I don't know if these are standard in the gnomad downloads, but vcfanno is aborting:

$ ./vcfanno_linux64 config.toml x.vcf.gz > y.vcf

=============================================
vcfanno version 0.3.5 [built with go1.19.3]

see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:116: found 6 sources from 1 files
vcfanno.go:146: using 2 worker threads to decompress bgzip file
api.go:796: header error in extra field: VEP version: v101. [line: 914]
header error in extra field: dbSNP version: b154. [line: 915]
$

The offending lines in the gnomad VCF are:

##VEP version: v101
##dbSNP version: b154

For reference, the config.toml I am using is:

[[annotation]]
file="/hpc/genomeref/hg38/annotation/gnomad/gnomad.genomes.v3.1.sites.combined.vcf.bgz"
# ID and FILTER are special fields that pull the ID and FILTER columns from the VCF
fields = [ "ID", "FILTER", "AC", "AN", "AF", "popmax" ]
ops    = [ "self", "self", "self", "self", "self", "self" ]
names  = [ "gnomad_ID", "gnomad_FILTER", "gnomad_AC", "gnomad_AN", "gnomad_AF", "gnomad_popmax" ]

[[postannotation]]
fields=["ANN"]
op="delete"

Any chance that the VCF parsing could be made a bit more tolerant for headers? It would be pretty painful to have to modify the GnomAD VCF.

Tom.

brentp commented 1 year ago

Hi, can you show me where to find this? I looked in this one: https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chrY.vcf.bgz and don't see those values

brentp commented 1 year ago

The spec: https://samtools.github.io/hts-specs/VCFv4.2.pdf says:

1.2 Meta-information lines
File meta-information is included after the ## string and must be key=value pairs.

So, I get that you want more lenient parsing, but do other parsers handle this? And is it something added at your institution? Or from the original gnomad files?

drtconway commented 1 year ago

Thanks for the fast response!

Yeah, I get that it's non-conformant. I like standards, and I think they are important, so I am sympathetic to the "your data is drunk. Come back when it's sober!" argument.

Especially when it comes to the metadata, I think there are two kinds of non-conformance. In some cases the non-conformance leads to a situation where the program can't figure out how to produce correct output. In other cases the problem is essentially cosmetic and is orthogonal to the production of correct output.

I am pretty sure the file is derived from the individual chromosome files by running VEP and concatenating them, but I don't know the precise provenance. I'm still trying to find out.

The Python and Rust libraries I use (and the C++ I've written) ignore non-conformant meta lines with a warning when reading, but scrupulously make sure they only emit conformant data.