arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

gemini annotate float error #676

Closed davetang closed 8 years ago

davetang commented 8 years ago
gemini annotate -f exome.vt.vep.vcf.gz -o list -e AF -t float exome.vt.vep.db
updated 100000 variants
Non float value found in annotation file: 0.021,0.058

gemini query -q 'select variant_id,AF from variants where AF is null limit 2' --header *.db
variant_id      AF
100001  None
100002  None

When I manually checked the variant at line 100001 in the VCF file, the AF has a single value: ;AF=0.208; and so does variant 100002: ;AF=0.34;. The AF of variant 100000 is stored correctly in the database, ;AF=0.118;. I'm not sure where the 0.021,0.058 came from. It was just a bit suspicious that the error occurred on line 100,001.

I'm getting the same error with VQSLOD.

gemini annotate -f exome.vt.vep.vcf.gz -o list -e VQSLOD -t float exome.vt.vep.db
updated 100000 variants
Non float value found in annotation file: -0.0946,-1.352
brentp commented 8 years ago

did you run vt decompose on exome.vt.vep.vcf.gz

davetang commented 8 years ago

Yep.

vt decompose -s $VCF | vt normalize -r $GENOME - | gzip > $BASE.vt.vcf.gz

brentp commented 8 years ago

Thanks for the full example! I'll have a look at how to handle this better. Note that for this case. VQSLOD is type=Float,Number=1, and so it violates the VCF spec by having 4.5487,4.5487 in that field.

davetang commented 8 years ago

No problems. But what I was initially reporting is that 4.5487,4.5487 doesn't exist in the file.

zcat UK10K_COHORT.20140722.sites.vt.vep.vcf.gz | grep "4.5487,4.5487"
# returns nothing

In my first example, 0.021,0.058 and -0.0946,-1.352 aren't found in the VCF file either. I also checked all the fields in that VCF file and they all have single values.

brentp commented 8 years ago

understood. I'll have a look this week. I tagged this for 0.18.3 so we'll have a fix soon.

brentp commented 8 years ago

for my own record, I'm testing on:

wget -O -$url \
    | zcat - | head -100000 \
    | vt decompose -s - \
    | vt normalize -r /data/human/hs37d5.fa - \
    | grep -v "ID=CSQ" | perl -pe 's/CSQ[^;]+//' \
    | bgzip -c > uk100k.vcf.gz

then vep annotate then :

tabix uk100k.vep.vcf.gz
gemini load --cores 4 -v uk100k.vep.vcf.gz -t VEP --no-genotypes uk100k.db

annotate:

gemini annotate -f uk100k.vep.vcf.gz -o list -e VQSLOD -t float uk100k.db
brentp commented 8 years ago

There are some interesting things in that VCF. For example:

1   2942599 .   A   ATGG

occurs twice but with different values in the info field.

brentp commented 8 years ago

@davetang I just pushed a fix for this that will be out in the next release. You can get it meanwhile with:

gemini_pip install git+https://github.com/arq5x/gemini.git