Closed JenniferShelton closed 4 years ago
Contigs, chromosomes and any other sequences are listed in the first column (CHROM. The description you quote applies to the third column (ID), which would contain identifiers like rs IDs. This is how the CHROM column is described in version 4.2 of the specification:
CHROM - chromosome: An identifier from the reference genome or an angle-bracketed ID String (“
”) pointing to a contig in the assembly file (cf. the ##assembly line in the header). All entries for a specific CHROM should form a contiguous block within the VCF file. The colon symbol (:) must be absent from all chromosome names to avoid parsing errors when dealing with breakends. (String, no white-space permitted, Required).
The specification maintainers are aware of this issue, which has been discussed and will be solved in the final revision of VCF v4.3.
BTW, @JenniferShelton I'm glad you mentioned how you use the --level
parameter, because now I think we didn't make this parameter clear enough.
To get all the spec violations, please use -l warning
and ignore the warnings.
I'm afraid you are not getting all the spec violations if you use --level error
because it not only skips the warnings, but also skips the semantic errors. From the readme:
The validation level can be configured using
-l
/--level
. This parameter is optional and accepts 3 values:
- error: Display only syntax errors
- warning: Display both syntax and semantic, both errors and warnings (default)
- stop: Stop after the first syntax error is found
For instance, if you look in the git repository at the folder test/input_files/v4.3/failed/ you'll see 219 invalid VCFs, but only 86 of those are detected as wrong VCFs with only syntax checking. Some of the checks the validator doesn't perform with just -l error
:
We're really sorry for the confusion here, we'll work to make this more intuitive. As a rule of thumb I recommend -l warning
and ignoring the warning lines.
Hi,
I love your validator and use it often when developing pipelines. I have one issue where I think you have a bug. I want to ensure only format spec violations cause the program to fail so I run vcd_validator with the
-level error
flag. The GRCh38 reference ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa includes contigs with many special characters in their names (e.g. 'HLA-DQA1*01:02:01:01'). None of these special characters are a semicolon or whitespace so based on the VCF spec they should be allowed I believe.From the VCFv4.2 spec:
Do you see some violation of the VCF spec here? If not could you allow all non-whitespace and non-semicolon characters?
Thanks for your time, Jennifer Shelton