EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
129 stars 39 forks source link

Star character in vcf headers #211

Closed V-Catherine closed 1 year ago

V-Catherine commented 3 years ago

Hello, Following a discussion we had with Sigve Nakken on vcf validation (cf. https://github.com/sigven/pcgr/issues/124), I would suggest to allow the star (*) character in the vcf headers as these are by default in HLA contig names when working with hg38. Here is an example of the lines that are in vcf files headers generated by mutect2:

contig=

Thanks, Best regards, Catherine

ttbek commented 2 years ago

The VCFv4.3 spec states the following:


"Contig names follow the same rules as the SAM format’s reference sequence names: they may contain any printable
ASCII characters in the range [!-~] apart from ‘\ , "‘’ () [] {} <>’ and may not start with ‘*’ or ‘=’. Thus they
match the following regular expression:
[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*
In particular, excluding commas facilitates parsing ##contig lines, and excluding the characters ‘<>[]’ and initial ‘*’
avoids clashes with symbolic alleles. The contig names must not use a reserved symbolic allele name."

In other words, "HLA-DRB1*15:01:01:04" should be a valid ID because while it has a '*', it does not start with a '*'. I'm a bit surprised that they have this bug as the spec gives the regular expression to validate with right there. Maybe this was different in older versions of the spec? Or maybe the validator uses a method other than regular expression matching to try and be more efficient.

I'm not very familiar with vcf-validator, I was just browsing to see if there was a good way to validate bcf files and came across this in the search results, and I habitually check outstanding issues on code before using it.

The VCFv4.3 spec can be found here: https://samtools.github.io/hts-specs/VCFv4.3.pdf

ttbek commented 2 years ago

You might want to check the version of your VCF file, as I mentioned, this may be different in older versions, as was apparently the case for '_', see https://github.com/EBIvariation/vcf-validator/issues/207 That underscore isn't allowed in v4.2 FORMAT IDs is a bit subtle, you need to catch the mid paragraph "First a FORMAT field is given specifying the data types and order (colon-separated alphanumeric String)." and mentally register that 'alphanumeric' excludes underscore. My statements that this INFO ID is valid are contingent on the use of the current version, v4.3. The version should be the very first line of the header.