arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

'=' sign in VEP CSQ field results in error #886

Open oleraj opened 6 years ago

oleraj commented 6 years ago

Hi,

I have recently started adding more annotations from VEP to my VCF and noticed that when I add MutPred2 (dbNSFP 2.9.3) that I get an error in Gemini, likely due to the = sign present in the annotation sometimes. Is it possible to skip this assertion?

Here's an example VCF record that reproduces the error message:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  CCGO_800390 CCGO_801065
1   1849995 .   T   C   313.26  PASS    AC=2;AF=0.5;AN=4;BaseQRankSum=-1.666;ClippingRankSum=-1.179;DP=26;ExcessHet=3.0295;FS=0;InbreedingCoeff=-0.0774;MQ=17.32;MQRankSum=1.67;QD=12.05;ReadPosRankSum=-0.762;SOR=0.573;VQSLOD=0.424;culprit=MQ;CSQ=start_lost|Atg/Gtg|ENST00000378602.3:c.1A>G|ENSP00000367865.3:p.Met1?|M/V|ENSG00000178821|TMEM52|ENST00000378602|1/2|benign(0)|deleterious_low_confidence(0)|1/194|protein_coding|||-0.9926|T|0.0466|T|-0.001108|2.575|8|||||0.999649&0.999997|D&N|||||1.54|T||0.014|0.29|N|||0.089|0.28835|0.1911398|0.1622601|0.996|Q8NDY8-2|M1V|Loss of disorder (P = 0.1016)& Loss of helix (P = 0.1299)& Loss of loop (P = 0.2897)& Loss of solvent accessibility (P = 0.3103)& Gain of sheet (P = 0.4423)||| GT:AD:DP:GQ:PL  0/1:5,12:17:99:332,0,147    0/1:7,2:9:39:39,0,245

Gemini command (using v0.20.0) and error message:

gemini load -v test_bad.vcf -t VEP  test_bad.db
CADD scores are being loaded (to skip use:--skip-cadd).
GERP per bp is being loaded (to skip use:--skip-gerp-bp).
[W::vcf_parse] contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)
Traceback (most recent call last):
  File "/sysapps/cluster/software/Anaconda/2.3.0Linux-x86_64/envs/geminienv2/bin/gemini", line 7, in <module>
    gemini_main.main()
  File "/sysapps/cluster/software/Anaconda/2.3.0Linux-x86_64/envs/geminienv2/lib/python2.7/site-packages/gemini/gemini_main.py", line 1244, in main
    args.func(parser, args)
  File "/sysapps/cluster/software/Anaconda/2.3.0Linux-x86_64/envs/geminienv2/lib/python2.7/site-packages/gemini/gemini_main.py", line 204, in load_fn
    gemini_load.load(parser, args)
  File "/sysapps/cluster/software/Anaconda/2.3.0Linux-x86_64/envs/geminienv2/lib/python2.7/site-packages/gemini/gemini_load.py", line 51, in load
    l = load_singlecore(args)
  File "/sysapps/cluster/software/Anaconda/2.3.0Linux-x86_64/envs/geminienv2/lib/python2.7/site-packages/gemini/gemini_load.py", line 86, in load_singlecore
    l.populate_from_vcf()
  File "/sysapps/cluster/software/Anaconda/2.3.0Linux-x86_64/envs/geminienv2/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 219, in populate_from_vcf
    (variant, variant_impacts) = self._prepare_variation(var, anno_keys)
  File "/sysapps/cluster/software/Anaconda/2.3.0Linux-x86_64/envs/geminienv2/lib/python2.7/site-packages/gemini/gemini_load_chunk.py", line 488, in _prepare_variation
    impacts += [geneimpacts.VEP(e, anno_keys["CSQ"]) for e in var.INFO["CSQ"].split(",")]
  File "/sysapps/cluster/software/Anaconda/2.3.0Linux-x86_64/envs/geminienv2/lib/python2.7/site-packages/geneimpacts/effect.py", line 516, in __init__
    assert not "=" in effect_string
AssertionError

I see the problem in general with having an = sign in the value of an INFO column (since each item is a key=value pair). But INFO column items are separated by ; and we know the list of items in the CSQ field are separated by , so it might be fine to omit this assertion in my opinion.

Thanks!

Andrew

brentp commented 5 years ago

does bcftools accept that CSQ field without error or warning?

oleraj commented 5 years ago

I believe so. Our cluster is down at the moment; I can double-check later this week.

oleraj commented 5 years ago

Yes, I ran bcftools view with the VCF and it prints out the VCF with no errors or warnings.