arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
317 stars 119 forks source link

Genotype filter using wildcard with gt_alt_freqs > 0.3 #925

Open OskarSchnappauf opened 5 years ago

OskarSchnappauf commented 5 years ago

Dear gemini team,

I use gemini very frequently and it is an awesome tool for variant prioritization within large databases. However, one thing I could not find out yet is how to use the gt.alt.freqs option in combination with a wildcard. For instance, I want all variants with impact severity MED or HIGH and with at least two affected individuals in our database: gemini query --header -q "SELECT gene, chrom, start, end FROM variants where impact_severity != 'LOW'" --gt-filter "(gt_types).(Phenotype==2).(==HET).(count >1) and (gt_types).(Phenotype==1).(==HOM_REF).(all)" gemini.db However, some of the identified variants have a very low gt.alt.freqs. How can I include a threshold for gt.alt.freqs for the identified variants? I tried : (gt_alt_freqs).(*).(>=0.3).(any), but it did not work.

Thank you very much for your help. Oskar

OskarSchnappauf commented 5 years ago

Anyone?

arq5x commented 5 years ago

When you say it did not work, do you mean you know for certain there are such variants and none were returned?

OskarSchnappauf commented 5 years ago

Hi Aaron, thank you so much for your reply. I don't know about the variants, but it does not even run, I get an error message.

Here is what I did and what the error message was: I browsed the database with this command: gemini query --header -q "SELECT gene, chrom, start, end FROM variants where impact_severity != 'LOW'" --gt-filter "(gt_types).(Phenotype==2).(==HET).(count >1) and (gt_types).(Phenotype==1).(==HOM_REF).(all) and (gt_alt_freqs).(*).(>=0.3).(any)" gemini.db

And the error message was: Traceback (most recent call last): File "/usr/local/apps/gemini/0.20.1/bin/gemini", line 7, in gemini_main.main() File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/gemini_main.py", line 1248, in main args.func(parser, args) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/gemini_main.py", line 439, in query_fn gemini_query.query(parser, args) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/gemini_query.py", line 169, in query run_query(args) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/gemini_query.py", line 135, in run_query gene_needed, args.show_families, subjects=subjects) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 622, in run self.gt_filter = self._correct_genotype_filter() File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 1047, in _correct_genotype_filter raise ValueError("Wildcard filter should consist of 4 elements. Exiting.") ValueError: Wildcard filter should consist of 4 elements. Exiting.

I think it is related to the "." in (>=0.3) since it complians about the number of elements. Any suggestion? Thank you so much, Oskar

udp3f commented 5 years ago

I encountered the same issue. It issues the "ValueError: Wildcard filter should consist of 4 elements" Also #868 is the same issue. Uma

timothee-revil commented 4 years ago

This can be fixed by changing the file ....../python2.7/site-packages/gemini/GeminiQuery.py

Line 1043: if token.count('.') != 3 or \ becomes if token.count(').(') != 3 or \

Line 1048: (column, wildcard, wildcard_rule, wildcard_op) = token.split('.') becomes column, wildcard, wildcard_rule, wildcard_op) = token.split(').(')

I have no idea if this breaks other functionalities, so make a backup of the original file.