EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
129 stars 39 forks source link

check ploidy when validating PL #101

Closed jaredo closed 6 years ago

jaredo commented 6 years ago

Hello! Thanks for your work on this.

I think there is an issue when validating FORMAT/PL for non-diploid genotypes. Consider the following region on chromosome X:

$ bcftools view -H example.vcf
chrX    10980118    rs1265885   C   T   1572    PASS    SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL    1/1:159:30:54:0:0,54:0,27:0,27:-88.3:PASS:370,163,0 1/1:135:30:46:0:0,46:0,23:0,23:-77.1:PASS:370,138,0 1:224:30:10:0:0,10:0,6:0,4:-24.3:PASS:231,0

$ ./vcf_validator -i example.vcf 
Reading from input file...
Line 115: Sample #3, PL=231,0 does not match the meta specification Number=G (contains 2 value(s), expected 3)
According to the VCF specification, the input file is not valid

I think the number of PL values for the male haploid sample should be equal to the the number of alleles ie. 2.

thanks

Jared

jmmut commented 6 years ago

I'm going to look further here to understand/recall better the problem, but a quick thing you can try is to use the --special-ploidy parameter to specify that the chrX is haploid. It might work adding the parameter as --special-ploidy chrX=1

jmmut commented 6 years ago

Although I see that you have diploid genotypes in the first 2 samples, is that correct?

jaredo commented 6 years ago

Although I see that you have diploid genotypes in the first 2 samples, is that correct?

That is right. For regions where ploidy can vary between samples, such as non-PAR chrX on humans, we need to be flexible with the length of the PL field. You could infer the expected length of PL from the ploidy of GT.

jmmut commented 6 years ago

I see. We had the impression that the specification required that all the samples had the same ploidy, but it doesn't actually requires so, and your point makes sense from the biological side. The specification also says that PL is actually expected to have the same ploidy as the GT.

You can expect we will allow this, but we haven't scheduled it yet.

cyenyxe commented 6 years ago

Related discussion taking place in https://github.com/samtools/hts-specs/issues/272

jmmut commented 6 years ago

in #114 we introduced a fix for this, the next version of the validator will accept that VCF as valid:

$ cat example.vcf
##fileformat=VCFv4.3
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG00096 HG00097 HG00098
chrX    10980118    rs1265885   C   T   1572    PASS    SNVHPOL=2;MQ=60 GT:GQ:GQX:DP:DPF:AD:ADF:ADR:SB:FT:PL    1/1:159:30:54:0:0,54:0,27:0,27:-88.3:PASS:370,163,0 1/1:135:30:46:0:0,46:0,23:0,23:-77.1:PASS:370,138,0 1:224:30:10:0:0,10:0,6:0,4:-24.3:PASS:231,0

$ ./vcf_validator -i example.vcf
[info] Reading from input file...
[info] According to the VCF specification, the input file is valid

$ cat example.vcf.errors_summary.1519743173046.txt
According to the VCF specification, the input file is valid
Warning: A valid 'reference' entry is not listed in the meta section. This occurs 1 time(s), first time in line 3.
Warning: Chromosome/contig 'chrX' is not described in a 'contig' meta description. This occurs 1 time(s), first time in line 3.

Of course, those warnings are there because I didn't include any meta-information, I used a minimal header.

jaredo commented 6 years ago

thanks!