EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
129 stars 39 forks source link

What does "Reference and alternate alleles do not share the first nucleotide" mean? #189

Closed EvanTheB closed 3 years ago

EvanTheB commented 4 years ago
Warning: Reference and alternate alleles do not share the first nucleotide. This occurs 1 time(s), first time in line 40.

For this variant (+header https://gist.github.com/EvanTheB/98ea93b53d3952697df1e8fcb72efb3b):

MT  105 .   CGGAGCA C,* .   .   AC=0,0;AN=2 GT:AD:DP:GQ:PL  0/0:1736,0,0:1736:99:0,120,1800,120,1800,1800

I cannot see what is wrong with that line, from my minor reading of the VCF spec. bcftools norm doesn't change it, GATK accepts it...

Any clues?

jmmut commented 4 years ago

The problem the validator is raising is that VCF requires a context base for indels or symbolic alleles where REF or ALT would result in empty strings:

jmmut commented 4 years ago

ok, there's some unresolved ambiguity in the spec https://github.com/samtools/hts-specs/issues/151, but it seems the overlapping deletion indeed doesn't need a context base. I'll leave this issue open until the bug is fixed in the validator.

Until then, please ignore that kind of warnings where there are overlapping deletions involved.

jgbaum commented 4 years ago

I'm getting similar errors in my VCF files. It seems to happen in indels, specifically. Here is an example of a line that generates such an error:

chr22   2009    48  GT  ATTC    .   PASS    .   GT:AD:DP    0/1:11,4:15 0/0:2,0:3

Is this line not properly formatted or is this an error in the validator?

Thanks very much!

-Jason

jmmut commented 4 years ago

Hi Jason. First of all, let me assure you that this message should say that it's a warning, so if you only get warnings, your VCF is correct.

Also, looking at you line, I confirm that it's correct. Specifically, VCF requires a context base for indels or symbolic alleles where REF or ALT would result in empty strings. In your case the indel would not result in an empty string in any of the REF or ALT alleles, so those alleles don't 'share the first nucleotide' but there's no problem with it.

It seems to me that this detail is an oversight in our side in the validator. It's also a slightly different issue than the original issue in this thread (about overlapping deletions), but we can fix this second issue because there is no pending discussion on the spec side.

Let me know if anything remains unclear, thanks.

jgbaum commented 4 years ago

Thanks for the quick reply and the explanation!

I've run into a separate issue for which I will open another ticket. Thank you!

tcezard commented 4 years ago

Tracked in EVA-2050