EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
130 stars 39 forks source link

Doesn't seem to handle INFO flags correctly #57

Closed sambrightman closed 7 years ago

sambrightman commented 7 years ago

I see Info DECOMPOSED= does not match the meta specification Number=0, expected 0 values for INFO columns ending ...;DECOMPOSED. Swapping it with the previous INFO field doesn't help (now ends e.g. ...;DECOMPOSED;type=snp) Definition is:

##INFO=<ID=DECOMPOSED,Number=0,Type=Flag,Description="The allele was parsed using vcfallelicprimitives.">

I think this format is correct?

jmmut commented 7 years ago

I'm sorry, but I couldn't reproduce the problem, and the current code seems to be handling it correctly, maybe it's a bug already fixed? May I suggest trying the last version?

If the problem is still there, please come again and tell us. A minimum sample of your VCF (or one made up that fails too) would be very helpful, as the definition of that DECOMPOSED flag has already been.

For example, this VCF does not raise any warning about DECOMPOSED in the last version of the validator.

##fileformat=VCFv4.3
##reference=GRCh37
##contig=<ID=1,Description="chr 1">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DECOMPOSED,Number=0,Type=Flag,Description="The allele was parsed using vcfallelicprimitives.">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG00096 HG00097
1   100 .   C   T   100 PASS    DECOMPOSED  GT  0|0 0|1

gives:

$ vcf_validator -v v4.3 -i /tmp/test3.vcf
Reading from input file...
According to the VCF v4.3 specification, the input file is valid
sambrightman commented 7 years ago

Yes, I'm using the latest version. This:

##fileformat=VCFv4.3
##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DECOMPOSED,Number=0,Type=Flag,Description="The allele was parsed using vcfallelicprimitives.">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CPCT11111111R   CPCT11111111T
1       53707   .       TA      T,TC    41.0414 .       TYPE=ins,snp;DECOMPOSED GT      .|.     0|1
1       53708   .       TA      T       41.0414 .       TYPE=del;DECOMPOSED     GT      .|.     0|1

Produces this:

Reading from input file...
Line 6: A valid 'reference' entry is not listed in the meta section (warning)
Line 6: Chromosome/contig '1' is not described in a 'contig' meta description (warning)
Line 7: Info DECOMPOSED= does not match the meta specification Number=0, expected 0 values
According to the VCF v4.3 specification, the input file is not valid

There are some odd things going on here. The DECOMPOSED flag problem is reported for the second line. If I remove the second alt from the first line, it is not reported. If I fix the warning for the missing reference= header, it is not reported. If I remove the last newline of the file, it is not reported. However, on a larger file with reference= and a final newline it still crops up.

I also note that removing the second line and not ending with a newline removes all warnings - the missing reference= and missing contig are no longer warned. It feels like something to do with optionality in the parser is wrong.

jmmut commented 7 years ago

Ok, sorry for the delay. Long story short, there were actually two bugs merged here. One was the way we were splitting the fields, which is already fixed in the develop branch.

The other one was about files that didn't have a newline before the end of file. We are currently working on that, but if all the lines in your files end with a newline, this should be no problem for you.

We will precompile another release version when we fix the second bug. If you want, you can compile the develop branch yourself, that should work fine for your DECOMPOSED problem.

Thanks for your feedback!

sambrightman commented 7 years ago

Great! To be clear, I normally have newlines at the end - the first example still shows the DECOMPOSED problem with a newline at the end. Is that the one which is already fixed in develop? I'm pretty sure I already compiled on develop (or at least as it was when I filed this issue).

jmmut commented 7 years ago

Oh, sorry for the confusion, the develop branch when you opened the issue was indeed bugged. I meant that I fixed it last week, and now the current develop should work (only for the DECOMPOSED bug, we are still working on the no-newline report message). If develop works for you, confirming it here so we can close the issue would be great.

sambrightman commented 7 years ago

Seems to be fixed indeed.