EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
129 stars 39 forks source link

Error to validate vcf file from stacks (gstack tools reference mapping) #218

Closed FranciscoAscue closed 1 year ago

FranciscoAscue commented 1 year ago

I worked with stacks2, and I want to submit this data to EBI I have the following errors after applying vcf_validator_linux and vcf_debugulator_linux:

According to the VCF specification, the input file is not valid
Error: Sample #5, field GL does not match the meta specification Number=G (expected 3 value(s)). This occurs 1842 time(s), first time in line 3160.
Warning: A valid 'reference' entry is not listed in the meta section. This occurs 1 time(s), first time in line 3160.
Error: Sample #7, field GL does not match the meta specification Number=G (expected 3 value(s)). This occurs 1107 time(s), first time in line 3161.
Error: Sample #3, field GL does not match the meta specification Number=G (expected 3 value(s)). This occurs 1509 time(s), first time in line 3162.
.
.
.

I Don't Know if the useful tools for handling VCF files that recommend on the EBI submissions help page are mandatory because Stacks generate VCF files directly. Any insight about the error above is helpful for us.

PD.

I worked with 40 individuals of Cavia Porcellus from RAD-seq sequencing and use the Scaffolds genome as reference (cavpor3)

tcezard commented 1 year ago

The GL field is supposed to contain Genotype likelihood. for a diploid individual there should be 3 values for 3 possible genotypes (AA/AB/BB). My guess is that stacks outputs something slightly different overloading the GL field. That makes the field incompatible with downstream tools and the VCF specification. You can remove the GL field from the VCF file with bcftools

bcftools annotate -x FORMAT/GL file.vcf.gz
FranciscoAscue commented 1 year ago

@tcezard thanks for the advice, running the validator and debugging didn't give me any more problems with GL, but they still give me problems with the following (only Duplicated errors):

According to the VCF specification, the input file is not valid
Warning: A valid 'reference' entry is not listed in the meta section. This occurs 1 time(s), first time in line 3162.
Error: Contig is not sorted by position. This occurs 10927 time(s), first time in line 3163.
Error: Duplicated variant NT_174338.1:4997:A>C found. This occurs 2 time(s), first time in line 3271.
Error: Duplicated variant NT_174393.1:289:C>G found. This occurs 2 time(s), first time in line 3275.
Error: Duplicated variant NT_174582.1:487:C>A found. This occurs 2 time(s), first time in line 3301.
Error: Duplicated variant NT_174766.1:304:G>C found. This occurs 2 time(s), first time in line 3334.
Error: Duplicated variant NT_174766.1:12005:A>C found. This occurs 2 time(s), first time in line 3339.
Error: Duplicated variant NT_174872.1:4926:G>A found. This occurs 2 time(s), first time in line 3392.
Error: Duplicated variant NT_175047.1:3882:A>C found. This occurs 2 time(s), first time in line 3445.
.
.
.

This is part of the vcf

NT_174338.1     4997    5233:76:+       A       C       .       PASS    NS=40;AF=0.175  GT:DP:AD:GQ     0/0:1:1,0:21    1/1:1:0,1:14    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    1/1:1:0,1:14       0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    1/1:1:0,1:14    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21       0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    1/1:1:0,1:14    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21       0/0:1:1,0:21    1/1:1:0,1:14    0/0:1:1,0:21    1/1:2:0,2:18    1/1:1:0,1:14    0/0:1:1,0:21    0/0:1:1,0:21
NT_174338.1     4997    5235:13:-       A       C       .       PASS    NS=40;AF=0.25   GT:DP:AD:GQ     0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    1/1:1:0,1:16    1/1:1:0,1:16       0/0:1:1,0:21    0/0:1:1,0:21    1/1:1:0,1:16    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21       0/0:1:1,0:21    0/0:1:1,0:21    1/1:1:0,1:16    0/0:1:1,0:21    0/0:1:1,0:21    1/1:1:0,1:16    0/0:1:1,0:21    0/0:1:1,0:21    0/0:1:1,0:21    1/1:1:0,1:16    1/1:1:0,1:16    0/0:2:2,0:25    0/0:1:1,0:21       1/1:1:0,1:16    0/0:1:1,0:21    0/0:1:1,0:21    1/1:1:0,1:16    1/1:1:0,1:16    0/0:1:1,0:21    0/0:1:1,0:21
NT_174393.1     289     5563:262:+      C       G       .       PASS    NS=36;AF=0.194  GT:DP:GQ        0/0:1:23        0/0:1:23        0/0:1:23        0/0:1:23        0/0:1:23        0/0:1:23        0/0:1:23  ./.:.:.  0/0:1:23        0/0:1:23        0/0:1:23        0/0:1:23        0/0:1:23        1/1:1:17        1/1:1:17        0/0:2:26        ./.:.:. 0/0:1:23        0/0:1:23        0/0:1:23        0/0:1:23        0/0:1:23   0/0:1:23        ./.:.:. 0/0:1:23        0/0:1:23        1/1:1:17        0/0:1:23        0/0:1:23        0/0:1:23        0/0:1:23        0/0:1:23        1/1:1:17        1/1:1:17        0/0:1:23        0/0:1:23   ./.:.:. 1/1:1:17        1/1:1:17        0/0:1:23
NT_174393.1     289     5564:43:-       C       G       .       PASS    NS=35;AF=0.229  GT:DP:GQ        0/0:1:21        0/0:1:21        0/0:1:21        0/0:1:21        0/0:1:21        0/0:2:24        0/0:1:21  ./.:.:.  0/0:1:21        0/0:1:21        0/0:1:21        1/1:1:15        0/0:1:21        1/1:1:15        1/1:1:15        0/0:1:21        0/0:1:21        1/1:1:15        0/0:1:21        0/0:1:21        0/0:1:21  0/0:1:21 0/0:1:21        ./.:.:. 1/1:1:15        0/0:1:21        0/0:1:21        1/1:1:15        1/1:1:15        0/0:1:21        0/0:1:21        0/0:1:21        1/1:1:15        ./.:.:. 0/0:1:21        0/0:1:21  ./.:.:.  0/0:1:21        0/0:1:21        ./.:.:.

The input data was filtered by MAF and missing data, but still, have errors.