Closed andrewelamb closed 4 years ago
Hum, interesting. This: ###contig=<...>
would define a custom property named #contig
. The spec actually doesn't stop you from doing that, so it would actually be correct.
We could issue a warning, but that would be flagging potential fair usages. For instance #researchers
would be ugly in my opinion, but valid nonetheless.
Outr team will discuss if it's worth to issue a warning.
Please see the second comment here:
https://github.com/googlegenomics/gcp-variant-transforms/issues/119
I was told headers like:
Were not spec.
Sadly, the VCF spec has similar things like this, where the reasonable thing is meant, but not written explicitly. This leads to different behaviours in different tools, depending on what "reasonable" means to each organization. I can be wrong, but as far as I know googlegenomics is not involved in the development of the spec.
You can verify yourself that a metadata key has no restriction on what characters can contain. It is only mentioned that "a metadata line is prefixed by "##" and is in the form of key=value
". So only the character "=" would be problematic to be in the metadata key. https://samtools.github.io/hts-specs/VCFv4.3.pdf
I'm going to create a ticket in the VCF spec repository, to confirm that they wanted to forbid 3 #'s, and to suggest to state that more clearly in the spec.
I didn't mention this, but, of course, with lines as "###FORMAT=..." you would be creating a custon key, called "#FORMAT", and whatever you are defining, no tool will interpret info in there as "FORMAT" info. The spec doesn't stop you from doing that, it only says what information the tools can read from standard keys such as "FORMAT".
I see, thanks for the clarification!
Will fix in hts-specs
I had some VCF's that passed this validator that went on to cause errors in other programs. It turns out the issue was that some of the headers had three #'s instead of two as is spec. It would be great if this validator checked for that.
Thanks!