exomiser / Exomiser

A Tool to Annotate and Prioritize Exome Variants
https://exomiser.readthedocs.io
GNU Affero General Public License v3.0
202 stars 55 forks source link

Exomiser VCF output includes whitespace in INFO field which are forbidden in VCF<4.3 #486

Closed ielis closed 1 year ago

ielis commented 1 year ago

Hi, I think there may be a bug in VCF file that is produced by Exomiser.

Specifically, the EXOMISER_ACMG_DISEASE_NAME sub-field may include a value such as "Presynaptic congenital myasthenic syndromes". However, the disease name will frequently contain whitespace characters which are not allowed in VCF<4.3.

The section 1.4.1 (8) of the VCF4.2 specs forbids presence of whitespace characters.

INFO - additional information: (String, no whitespace, semicolons, or equals-signs permitted; commas are permitted only as delimiters for lists of values) INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: =[,data]. ...

However, the restriction was apparently lifted in VCF 4.3:

INFO — additional information: Semicolon-separated series of additional information fields, or the MISS- ING value ‘.’ if none are present. ... Space characters are allowed in values.

KevinDuringWork commented 1 year ago

Looks like its and HTSJDK issue: https://github.com/samtools/htsjdk/blob/7719274fe370a51a24e6067de21bbe7e18c160a9/src/main/java/htsjdk/variant/vcf/AbstractVCFCodec.java#L515

julesjacobsen commented 1 year ago

@ielis Damn. I wanted to write out VCF 4.3 just to be able to do this, but HTSJDK will only do 4.2 and I forgot to add underscores back into the disease name...