JayYip / m3tl

BERT for Multitask Learning
https://jayyip.github.io/m3tl/
Apache License 2.0
545 stars 125 forks source link

Out-of-memory issue #40

Closed autapomorphy closed 4 years ago

autapomorphy commented 4 years ago

I tried to run the notebook Run Pre-defined problems.ipynb after

train_bert_multitask(problem='weibo_ner&weibo_cws', num_gpus=1, num_epochs=3)

I got the error message:

Traceback (most recent call last): File "/cluster/kappa/90-days-archive///g_transformer/git/bert-multitask-learning/bert_multitask_learning/params.py", line 206, in assign_problem self.get_data_info(self.problem_list, self.ckpt_dir) File "/cluster/kappa/90-days-archive///g_transformer/git/bert-multitask-learning/bert_multitask_learning/params.py", line 270, in get_data_info list(self.read_data_fnproblem)) File "/cluster/kappa/90-days-archive///g_transformer/git/bert-multitask-learning/bert_multitask_learning/create_generators.py", line 300, in create_single_problem_generator example_list=example) for example in example_list File "/cluster/tufts//lib/anaconda3/envs/1001-nlp/lib/python3.7/site-packages/joblib/parallel.py", line 1017, in call self.retrieve() File "/cluster/tufts//lib/anaconda3/envs/1001-nlp/lib/python3.7/site-packages/joblib/parallel.py", line 909, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/cluster/tufts//lib/anaconda3/envs/1001-nlp/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result return future.result(timeout=timeout) File "/cluster/tufts//lib/anaconda3/envs/1001-nlp/lib/python3.7/concurrent/futures/_base.py", line 435, in result return self.get_result() File "/cluster/tufts/**/lib/anaconda3/envs/1001-nlp/lib/python3.7/concurrent/futures/_base.py", line 384, in get_result raise self._exception joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

How much RAM do I need?

JayYip commented 4 years ago

You can turn off multi processing in config to save some memory.

On Sat, Apr 25, 2020, 9:46 AM autapomorphy notifications@github.com wrote:

I tried to run the notebook Run Pre-defined problems.ipynb after

train_bert_multitask(problem='weibo_ner&weibo_cws', num_gpus=1, num_epochs=3)

I got the error message:

Traceback (most recent call last): File "/cluster/kappa/90-days-archive///g_transformer/git/bert-multitask-learning/bert_multitask_learning/params.py", line 206, in assign_problem self.get_data_info(self.problem_list, self.ckpt_dir) File "/cluster/kappa/90-days-archive///g_transformer/git/bert-multitask-learning/bert_multitask_learning/params.py", line 270, in get_data_info list(self.read_data_fnproblem http://self,)) File "/cluster/kappa/90-days-archive///g_transformer/git/bert-multitask-learning/bert_multitask_learning/create_generators.py", line 300, in create_single_problem_generator example_list=example) for example in example_list File "/cluster/tufts/

/lib/anaconda3/envs/1001-nlp/lib/python3.7/site-packages/joblib/parallel.py", line 1017, in call self.retrieve() File "/cluster/tufts//lib/anaconda3/envs/1001-nlp/lib/python3.7/site-packages/joblib/parallel.py", line 909, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/cluster/tufts/

/lib/anaconda3/envs/1001-nlp/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result return future.result(timeout=timeout) File "/cluster/tufts//lib/anaconda3/envs/1001-nlp/lib/python3.7/concurrent/futures/_base.py", line 435, in result return self.get_result() File "/cluster/tufts/**/lib/anaconda3/envs/1001-nlp/lib/python3.7/concurrent/futures/_base.py", line 384, in get_result raise self._exception joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

How much RAM do I need?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JayYip/bert-multitask-learning/issues/40, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADS2OTAVTMFC2HJ5SBYQE73ROI6IPANCNFSM4MQRBVLQ .