Closed epetros closed 5 years ago
Hi @epetros,
Are you running GPT Large by chance? Currently that model is only trainable on quite large GPUs (I think minimum memory requirement is 12GB or greater, even with low_memory_mode enabled).
I would recommend reading down our Resource Management guide briefly and see if any of these tips / tricks might help.
Yes, I was trying to fine tune GPT2 Large in a 12GB Tesla K80. Seems like a more powerful GPU is needed indeed.
I think you may be able to turn down your max_length
to something a bit smaller and be OK? But then of course you're also losing a lot of benefit from running a giant model like GPTLarge
in the first place.
Going to close for now, but feel free to open if you run into further issues! Happy hacking.
Greetings, I am unfortunately getting this error: model.finetune(corpus_text, batch_size=1)
INFO:finetune: Visible GPUs: {0: Tesla K80} Epoch 3/3: 0%| | 23/35897 [00:00<10:13, 58.46it/s]
ResourceExhaustedError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, args) 1355 try: -> 1356 return fn(args) 1357 except errors.OpError as e:
21 frames
ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[1,512,5120] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[1,512,5120] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations. 0 derived errors ignored.
During handling of the above exception, another exception occurred:
ResourceExhaustedError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1368 pass 1369 message = error_interpolation.interpolate(message, self._graph) -> 1370 raise type(e)(node_def, op, message) 1371 1372 def _extend_graph(self):
ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[1,512,5120] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow (defined at /usr/local/lib/python3.6/dist-packages/finetune/model.py:281) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[1,512,5120] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow (defined at /usr/local/lib/python3.6/dist-packages/finetune/model.py:281) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations. 0 derived errors ignored.
Errors may have originated from an input operation. Input Source operations connected to node OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow: model/featurizer/h6/mlp/c_fc/Reshape_2 (defined at /usr/local/lib/python3.6/dist-packages/finetune/base_models/gpt2/featurizer.py:41)
Input Source operations connected to node OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow: model/featurizer/h6/mlp/c_fc/Reshape_2 (defined at /usr/local/lib/python3.6/dist-packages/finetune/base_models/gpt2/featurizer.py:41)
Original stack trace for 'OptimizeLoss/gradients/model/featurizer/h6/mlp/Pow_grad/Pow': File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in
app.launch_new_instance()
File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 477, in start
ioloop.IOLoop.instance().start()
File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 888, in start
handler_func(fd_obj, events)
File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, *kwargs)
File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
handler(stream, idents, msg)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2828, in run_ast_nodes
if self.run_code(code, result):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
model.finetune(corpus_text, batch_size=1)
File "/usr/local/lib/python3.6/dist-packages/finetune/target_models/language_model.py", line 34, in finetune
return super().finetune(X, Y=None, batch_size=batch_size)
File "/usr/local/lib/python3.6/dist-packages/finetune/target_models/classifier.py", line 141, in finetune
return super().finetune(X, Y=Y, batch_size=batch_size)
File "/usr/local/lib/python3.6/dist-packages/finetune/base.py", line 283, in finetune
estimator.train(train_input_fn, hooks=train_hooks, steps=num_steps)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1156, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1219, in _train_model_distributed
self._config._train_distribute, input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1299, in _actual_train_model_distributed
self.config))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1555, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/one_device_strategy.py", line 163, in _call_for_each_replica
return fn(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
model_fn_results = self._model_fn(features=features, kwargs)
File "/usr/local/lib/python3.6/dist-packages/finetune/model.py", line 281, in _model_fn
variables=params.trained_variables,
File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/layers/python/layers/optimizers.py", line 239, in optimize_loss
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/optimizer.py", line 512, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_impl.py", line 158, in gradients
unconnected_gradients)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py", line 731, in _GradientsHelper
lambda: grad_fn(op, out_grads))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py", line 403, in _MaybeCompile
return grad_fn() # Exit early
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_util.py", line 731, in
lambda: grad_fn(op, out_grads))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py", line 1197, in _PowGrad
math_ops.reduce_sum(grad y math_ops.pow(x, y - 1), rx), sx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
return target(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 450, in pow
return gen_math_ops._pow(x, y, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 6972, in _pow
"Pow", x=x, y=y, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()
...which was originally created as op 'model/featurizer/h6/mlp/Pow', defined at: File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) [elided 28 identical lines from previous traceback] File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn model_fn_results = self._model_fn(features=features, kwargs) File "/usr/local/lib/python3.6/dist-packages/finetune/model.py", line 128, in _model_fn explain=build_explain, File "/usr/local/lib/python3.6/dist-packages/finetune/base_models/init.py", line 28, in get_featurizer return cls.featurizer(X, encoder, config, train=train, reuse=reuse, kwargs) File "/usr/local/lib/python3.6/dist-packages/finetune/base_models/gpt2/featurizer.py", line 206, in gpt2_featurizer h = block_fn(h) File "/usr/local/lib/python3.6/dist-packages/finetune/base_models/gpt2/featurizer.py", line 121, in block m = mlp(norm(x, "ln_2"), "mlp", nx 4, hparams=hparams, train=train) File "/usr/local/lib/python3.6/dist-packages/finetune/base_models/gpt2/featurizer.py", line 108, in mlp h = gelu(conv1d(x, "c_fc", n_state)) File "/usr/local/lib/python3.6/dist-packages/finetune/nn/activations.py", line 10, in gelu return 0.5 x (1 + tf.tanh(math.sqrt(2 / math.pi) (x + 0.044715 tf.pow(x, 3)))) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper return target(args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 450, in pow return gen_math_ops._pow(x, y, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 6972, in _pow "Pow", x=x, y=y, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()