arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
317 stars 119 forks source link

gemini comp_hets memory issue #921

Open 8nb24 opened 5 years ago

8nb24 commented 5 years ago

Output of gemini --version: gemini 0.20.1

...

When running gemini comp_hets on a large database (~1800 individuals WGS) I get the following error:

Traceback (most recent call last): File "/usr/local/apps/gemini/0.20.1/bin/gemini", line 7, in gemini_main.main() File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/gemini_main.py", line 1248, in main args.func(parser, args) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/gemini_main.py", line 710, in comp_hets_fn CompoundHet(args).run() File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/gim.py", line 307, in run for i, s in enumerate(self.report_candidates()): File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/gim.py", line 213, in report_candidates for gene, li in self.candidates(): File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/gim.py", line 459, in candidates for grp, li in self.gen_candidates('gene'): File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/gim.py", line 115, in gen_candidates self.gq.run(q, needs_genotypes=True) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 653, in run self.result_proxy = res = iter(self._apply_query()) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 907, in _apply_query res = self._execute_query() File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 879, in _execute_query res = self.conn.execute(sql.text(self.query)) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1176, in execute bind, close_with_result=True).execute(clause, params or {}) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 948, in execute return meth(self, multiparams, params) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 269, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1060, in _execute_clauseelement compiled_sql, distilled_params File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1200, in _execute_context context) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1416, in _handle_dbapi_exception util.reraise(*exc_info) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context context) File "/usr/local/Anaconda/envs_app/gemini/0.20.1/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 507, in do_execute cursor.execute(statement, parameters) MemoryError

I attempted to run this command on a large memory node allocated specifically to this task unsuccessfully. I am wondering if there is an alternative way to store the database that would alleviate this issue or how you would otherwise advise?

brentp commented 5 years ago

hmm. I'll have a look and see if I can reduce the memory use a bit or see why this might be happening. Even with 1800 samples, it shouldn't use much memory.

8nb24 commented 5 years ago

Thanks for looking. I got this same error with and without bcolz_index made. I looked at the node statistics and it looks like it never exceeded 3.5G of memory in use.

brentp commented 5 years ago

could you add --filter " gene != '' " to you comp_hets call? or if you already have a --filter, add AND gene != '' ? and let me know if that reduces the memory use?