EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
129 stars 39 forks source link

headers with number of # charcaters other than 2 #118

Closed andrewelamb closed 4 years ago

andrewelamb commented 6 years ago

I had some VCF's that passed this validator that went on to cause errors in other programs. It turns out the issue was that some of the headers had three #'s instead of two as is spec. It would be great if this validator checked for that.

Thanks!

jmmut commented 6 years ago

Hum, interesting. This: ###contig=<...> would define a custom property named #contig. The spec actually doesn't stop you from doing that, so it would actually be correct.

We could issue a warning, but that would be flagging potential fair usages. For instance #researchers would be ugly in my opinion, but valid nonetheless.

Outr team will discuss if it's worth to issue a warning.

andrewelamb commented 6 years ago

Please see the second comment here:

https://github.com/googlegenomics/gcp-variant-transforms/issues/119

I was told headers like:

FORMAT=

Were not spec.

jmmut commented 6 years ago

Sadly, the VCF spec has similar things like this, where the reasonable thing is meant, but not written explicitly. This leads to different behaviours in different tools, depending on what "reasonable" means to each organization. I can be wrong, but as far as I know googlegenomics is not involved in the development of the spec.

You can verify yourself that a metadata key has no restriction on what characters can contain. It is only mentioned that "a metadata line is prefixed by "##" and is in the form of key=value". So only the character "=" would be problematic to be in the metadata key. https://samtools.github.io/hts-specs/VCFv4.3.pdf

I'm going to create a ticket in the VCF spec repository, to confirm that they wanted to forbid 3 #'s, and to suggest to state that more clearly in the spec.

jmmut commented 6 years ago

I didn't mention this, but, of course, with lines as "###FORMAT=..." you would be creating a custon key, called "#FORMAT", and whatever you are defining, no tool will interpret info in there as "FORMAT" info. The spec doesn't stop you from doing that, it only says what information the tools can read from standard keys such as "FORMAT".

andrewelamb commented 6 years ago

I see, thanks for the clarification!

tcezard commented 4 years ago

Will fix in hts-specs