Open grendeloz opened 3 years ago
It seems the FileDate wasn't a user-defined field, it should be part of "Source". Because it is missing double quotes, so accidentally becomes an extra field. No doubt we should put double quotes to all string values which allows space etc.
the comma inside a quoted string may be ok. But seme colon also good between file name an date inside Source.
What golang
command are you using to test the validity of the vcf?
So it looks like there was a change on the 7th of June 2018 that removed the Source
, FileDate
and Version
attributes from the INFO
field header.
Running vcftools'
vcf-validator
against the vcf file in question throws a number of errors (this vcf harks back to the days where we would ampersand delimit values from samples run on different callers) but doesn't complain about the header.
If there is a way to be more thorough in our vcf validation (perhaps by using your golang
command) then we should incorporate that into our tests.
Sorry, not a golang command - I'm using the github.com/brentp/vcfgo package (like a class but not OO) in my own code and it was borking. It's been working fine on the toy VCFs I use for testing but it didn't like the full Colo-829 VCF I pulled down for testing at proper scale.
Ollie - any idea on why we removed the Source
, FileDate
and Version
? The FileDate is obviously non-standard but the other 2 are fine as long as they are double quoted. And after a more careful reading of the the spec, I don't think my proposed compound Source
field would have worked anyway - it seems pretty clear that a simple value like dbsnp
is what is expected in that field.
Ollie - any idea on why we removed the
Source
,FileDate
andVersion
? The FileDate is obviously non-standard but the other 2 are fine as long as they are double quoted. And after a more careful reading of the the spec, I don't think my proposed compoundSource
field would have worked anyway - it seems pretty clear that a simple value likedbsnp
is what is expected in that field.
I can't remember JP. It could just have been a mistake (I had merged a long running qsnp branch that commented out the Source/Version/FileDate
attributes). If you would like the Source
and Version
to make it back in then please say the word. We would need to identify which INFO fields this should be applied to.
The bug
golang VCF libraries will not read out VCFs.
According to a strict reading of the last few version of the VCF spec (https://samtools.github.io/hts-specs/VCFv4.2.pdf, https://samtools.github.io/hts-specs/VCFv4.3.pdf) the INFO field has a prescribed format which we are departing from in a number of ways:
Source
element is not enclosed in double quotes.Version
element is not enclosed in double quotes.FileDate
To Reproduce
I am using VCF
fbe3b136-dc8b-4c8d-bde3-a6390c91b521.vcf
from the COLO-829 analysisanalysis_fbe3b136-dc8b-4c8d-bde3-a6390c91b521
for testing my code. The following two lines demonstrate the problems shown above - the first line has all 3 problems and the second line has problems 1 and 2:If I cut out the first 1000 lines form this VCF and rectify all 3 problems, then the VCF will parse.
Expected behavior
The golang library appears to be applying the VCF spec strictly and not allowing for the addition of user-defined fields in INFO lines however the spec does not explicitly allow for user-defined fields in INFO lines so I think we should stop using them.
I'm guessing qannotate may be adding these but wherever it is, I'd like to have the quoting fixed. And for the FileDate, we could make Source a composite field that also contained the file date, for example:
If we go down the composite field path, I would suggest that we use semi-colon as the separator because comma is the default separator between subfields within an INFO field.