EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
129 stars 39 forks source link

Metadata ID restriction more strict than VCF format for HLA contigs in GRCHh38 build with decoys #108

Closed JenniferShelton closed 4 years ago

JenniferShelton commented 6 years ago

Hi,

I love your validator and use it often when developing pipelines. I have one issue where I think you have a bug. I want to ensure only format spec violations cause the program to fail so I run vcd_validator with the -level error flag. The GRCh38 reference ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa includes contigs with many special characters in their names (e.g. 'HLA-DQA1*01:02:01:01'). None of these special characters are a semicolon or whitespace so based on the VCF spec they should be allowed I believe.

From the VCFv4.2 spec:

  1. ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no white-space or semi-colons permitted)

Do you see some violation of the VCF spec here? If not could you allow all non-whitespace and non-semicolon characters?

Thanks for your time, Jennifer Shelton

cyenyxe commented 6 years ago

Contigs, chromosomes and any other sequences are listed in the first column (CHROM. The description you quote applies to the third column (ID), which would contain identifiers like rs IDs. This is how the CHROM column is described in version 4.2 of the specification:

CHROM - chromosome: An identifier from the reference genome or an angle-bracketed ID String (“”) pointing to a contig in the assembly file (cf. the ##assembly line in the header). All entries for a specific CHROM should form a contiguous block within the VCF file. The colon symbol (:) must be absent from all chromosome names to avoid parsing errors when dealing with breakends. (String, no white-space permitted, Required).

The specification maintainers are aware of this issue, which has been discussed and will be solved in the final revision of VCF v4.3.

jmmut commented 6 years ago

BTW, @JenniferShelton I'm glad you mentioned how you use the --level parameter, because now I think we didn't make this parameter clear enough.

To get all the spec violations, please use -l warning and ignore the warnings.

I'm afraid you are not getting all the spec violations if you use --level error because it not only skips the warnings, but also skips the semantic errors. From the readme:

The validation level can be configured using -l / --level. This parameter is optional and accepts 3 values:

  • error: Display only syntax errors
  • warning: Display both syntax and semantic, both errors and warnings (default)
  • stop: Stop after the first syntax error is found

For instance, if you look in the git repository at the folder test/input_files/v4.3/failed/ you'll see 219 invalid VCFs, but only 86 of those are detected as wrong VCFs with only syntax checking. Some of the checks the validator doesn't perform with just -l error:

We're really sorry for the confusion here, we'll work to make this more intuitive. As a rule of thumb I recommend -l warning and ignoring the warning lines.