IndicoDataSolutions / finetune

Scikit-learn style model finetuning for NLP
https://finetune.indico.io
Mozilla Public License 2.0
702 stars 81 forks source link

ResourceExhaustedError #388

Closed epetros closed 5 years ago

epetros commented 5 years ago

Greetings, I am unfortunately getting this error: model.finetune(corpus_text, batch_size=1)

INFO:finetune: Visible GPUs: {0: Tesla K80} Epoch 3/3: 0%| | 23/35897 [00:00<10:13, 58.46it/s]


ResourceExhaustedError Traceback (most recent call last)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, args) 1355 try: -> 1356 return fn(args) 1357 except errors.OpError as e:

21 frames

ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[1,512,5120] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[OptimizeLoss/global_norm_2/global_norm/_7755]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[1,512,5120] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

ResourceExhaustedError Traceback (most recent call last)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1368 pass 1369 message = error_interpolation.interpolate(message, self._graph) -> 1370 raise type(e)(node_def, op, message) 1371 1372 def _extend_graph(self):

ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[1,512,5120] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow (defined at /usr/local/lib/python3.6/dist-packages/finetune/model.py:281) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[OptimizeLoss/global_norm_2/global_norm/_7755]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[1,512,5120] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow (defined at /usr/local/lib/python3.6/dist-packages/finetune/model.py:281) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored.

Errors may have originated from an input operation. Input Source operations connected to node OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow: model/featurizer/h6/mlp/c_fc/Reshape_2 (defined at /usr/local/lib/python3.6/dist-packages/finetune/base_models/gpt2/featurizer.py:41)

Input Source operations connected to node OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow: model/featurizer/h6/mlp/c_fc/Reshape_2 (defined at /usr/local/lib/python3.6/dist-packages/finetune/base_models/gpt2/featurizer.py:41)

Original stack trace for 'OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow': File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in app.launch_new_instance() File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 658, in launch_instance app.start() File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 477, in start ioloop.IOLoop.instance().start() File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 888, in start handler_func(fd_obj, events) File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper return fn(*args, kwargs) File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events self._handle_recv() File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv self._run_callback(callback, msg) File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback callback(*args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper return fn(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher return self.dispatch_shell(stream, msg) File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell handler(stream, idents, msg) File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request user_expressions, allow_stdin) File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell return super(ZMQInteractiveShell, self).run_cell(*args, kwargs) File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell interactivity=interactivity, compiler=compiler, result=result) File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2828, in run_ast_nodes if self.run_code(code, result): File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in model.finetune(corpus_text, batch_size=1) File "/usr/local/lib/python3.6/dist-packages/finetune/target_models/language_model.py", line 34, in finetune return super().finetune(X, Y=None, batch_size=batch_size) File "/usr/local/lib/python3.6/dist-packages/finetune/target_models/classifier.py", line 141, in finetune return super().finetune(X, Y=Y, batch_size=batch_size) File "/usr/local/lib/python3.6/dist-packages/finetune/base.py", line 283, in finetune estimator.train(train_input_fn, hooks=train_hooks, steps=num_steps) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1156, in _train_model return self._train_model_distributed(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1219, in _train_model_distributed self._config._train_distribute, input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1299, in _actual_train_model_distributed self.config)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1555, in call_for_each_replica return self._call_for_each_replica(fn, args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/one_device_strategy.py", line 163, in _call_for_each_replica return fn(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn model_fn_results = self._model_fn(features=features, kwargs) File "/usr/local/lib/python3.6/dist-packages/finetune/model.py", line 281, in _model_fn variables=params.trained_variables, File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/layers/python/layers/optimizers.py", line 239, in optimize_loss colocate_gradients_with_ops=colocate_gradients_with_ops) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/optimizer.py", line 512, in compute_gradients colocate_gradients_with_ops=colocate_gradients_with_ops) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_impl.py", line 158, in gradients unconnected_gradients) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py", line 731, in _GradientsHelper lambda: grad_fn(op, out_grads)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py", line 403, in _MaybeCompile return grad_fn() # Exit early File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py", line 731, in lambda: grad_fn(op, out_grads)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py", line 1197, in _PowGrad math_ops.reduce_sum(grad y math_ops.pow(x, y - 1), rx), sx) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper return target(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 450, in pow return gen_math_ops._pow(x, y, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 6972, in _pow "Pow", x=x, y=y, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()

...which was originally created as op 'model/featurizer/h6/mlp/Pow', defined at: File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) [elided 28 identical lines from previous traceback] File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn model_fn_results = self._model_fn(features=features, kwargs) File "/usr/local/lib/python3.6/dist-packages/finetune/model.py", line 128, in _model_fn explain=build_explain, File "/usr/local/lib/python3.6/dist-packages/finetune/base_models/init.py", line 28, in get_featurizer return cls.featurizer(X, encoder, config, train=train, reuse=reuse, kwargs) File "/usr/local/lib/python3.6/dist-packages/finetune/base_models/gpt2/featurizer.py", line 206, in gpt2_featurizer h = block_fn(h) File "/usr/local/lib/python3.6/dist-packages/finetune/base_models/gpt2/featurizer.py", line 121, in block m = mlp(norm(x, "ln_2"), "mlp", nx 4, hparams=hparams, train=train) File "/usr/local/lib/python3.6/dist-packages/finetune/base_models/gpt2/featurizer.py", line 108, in mlp h = gelu(conv1d(x, "c_fc", n_state)) File "/usr/local/lib/python3.6/dist-packages/finetune/nn/activations.py", line 10, in gelu return 0.5 x (1 + tf.tanh(math.sqrt(2 / math.pi) (x + 0.044715 tf.pow(x, 3)))) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper return target(args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 450, in pow return gen_math_ops._pow(x, y, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 6972, in _pow "Pow", x=x, y=y, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()

madisonmay commented 5 years ago

Hi @epetros,

Are you running GPT Large by chance? Currently that model is only trainable on quite large GPUs (I think minimum memory requirement is 12GB or greater, even with low_memory_mode enabled).

I would recommend reading down our Resource Management guide briefly and see if any of these tips / tricks might help.

epetros commented 5 years ago

Yes, I was trying to fine tune GPT2 Large in a 12GB Tesla K80. Seems like a more powerful GPU is needed indeed.

madisonmay commented 5 years ago

I think you may be able to turn down your max_length to something a bit smaller and be OK? But then of course you're also losing a lot of benefit from running a giant model like GPTLarge in the first place.

madisonmay commented 5 years ago

Going to close for now, but feel free to open if you run into further issues! Happy hacking.