EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
129 stars 39 forks source link

Sample #2 has 2 allele(s), but 1 were found in others (warning) #193

Closed jgbaum closed 4 years ago

jgbaum commented 4 years ago

I'm using this docker container (https://hub.docker.com/r/thomasvyu/vcf-validator/tags) that I believe has version 0.6 of vcf-validator. I'm currently obligated to use this version and have the VCFs pass validation without any warnings, so I need to figure out a workaround.

I am getting many warning messages like:

Sample #2 has 2 allele(s), but 1 were found in others (warning)

Here's an example row which throws such a message:

chrY 22077202 39635 C A . FAIL . GT:AD:DP .:.,0:0 0/0:1,2:3

I imagine it has something to do with the periods '.' in the GT or AD fields. Should these be replaced with, e.g., 0 to pass this version of the validator?

Thanks very much for your help!

-Jason

jmmut commented 4 years ago

TL;DR ignore these warnings. it's a bug in that version of the validator

this is related to https://github.com/samtools/hts-specs/issues/419 and https://github.com/samtools/hts-specs/issues/229. At that time it wasn't clear if . can be used for fields with Number>1, such as GT and AD.

For GT it is preferred if you put the correct ploidy even when missing (./. for diploid), and for other fields (such as AD) it is ok to put just ..

In your example, the AD field does have 2 values, so it's ok. However your GT field only has . (remember you split the FORMAT fields by :). BUT, I see this variant is in the chromosome Y which should have ploidy 1, so the line is correct indeed. For this exact case we decided this was a bug and removed this constraint of each sample having the same ploidy at https://github.com/EBIvariation/vcf-validator/pull/114, which is fixed for later versions.

The validator is not perfect specially older versions (we are also aware of other small bugs in the current version that we haven't fixed yet), and having it to pass without any warnings at all, inflexibly, is sometimes pointless. The only ways to make it pass is putting ./. which is incorrect for chrY, or dropping variants in the chrY, which is a waste.

Sorry there's no way around it in this case.

jgbaum commented 4 years ago

OK, thanks very much for the quick and thoughtful reply!

-J

jmmut commented 4 years ago

I just reread this thread, and I realised another detail.

With Sample #2 has 2 allele(s), but 1 were found in others, the validator is noting that sample 1 has GT=. and sample 2 has GT=0/0. This would be correct for male/female samples in chrX, but in your example you use chrY, should not be diploid in any case for human. 0 should have been used instead of 0/0. It might be ok if it's a different species.