EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
129 stars 39 forks source link

vcf_validator finds only few errors at a time #222

Closed sozerberk closed 1 year ago

sozerberk commented 1 year ago

Hi,

I've been testing vcf_validator on Mac and Linux and the behavior is the same on both. I have a terribly formatted VCF. Errors are mostly from unmatched INFO in the data and header. 8-10 field are affected but vcf_validator only finds 2-3 at a time and not all lines are marked. For example, when I run validator for the first time, I get the following:

Error: INFO dbSNPBuildID does not match the meta specification Number=1 (expected 1 value(s)). This occurs 814 time(s), first time in line 737.
Error: Info field value is not a comma-separated list of valid strings (maybe it contains whitespaces?). This occurs 20 time(s), first time in line 3487.
Error: INFO p3_1000G_AN does not match the meta specification Number=1 (expected 1 value(s)). This occurs 8 time(s), first time in line 89454.

Then I fix them with debugulator. Run vcf validator again, and get the following:

Error: Info field value is not a comma-separated list of valid strings (maybe it contains whitespaces?). This occurs 20 time(s), first time in line 3487.
Error: INFO p3_1000G_AN does not match the meta specification Number=1 (expected 1 value(s)). This occurs 1 time(s), first time in line 38801.
Error: INFO p3_1000G_DP does not match the meta specification Number=1 (expected 1 value(s)). This occurs 8 time(s), first time in line 89454.

There are two issues here:

  1. p3_1000G_AN was not completely fixed with the first run
  2. p3_1000G_DP was not detected in the first run

And second run is not enough, it goes 5 times for a small VCF. So it cannot find p3_1000G_DP unless I fix p3_1000G_AN or dbSNPBuildID.

Is this intentional?

Thank you!