Is T5 3B training properly parallelizing?

rguan1 commented 2 years ago

I am trying to train a T5 model on empathetic dialogues. I am running into cuda OOM errors when training my model with the following command. When training the BlenderBot 3B model, I ran into this issue until I parallelized my training across two GPUs. However, it seems that parallelizing T5 3B doesn't resolve the issue. Also, I've reduced the batchsize to 1 and the truncate to 128 (truncate at 64 also doesn't work). Any suggestions to resolve the issue?

Command

parlai train_model -t empathetic_dialogues -m hugging_face/t5 --t5-model-arch t5-3b --t5-model-parallel True --fp16 True --optimizer adam --batchsize 1 --skip-generation True -vmt ppl -tr 64 --model-file ./chatbot_models/3B/testdebugT5/model --tstep 100

Error message

/home/rg4312/ParlAI/parlai/utils/fp16.py:85: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
  return torch.nn.utils.clip_grad_norm_(params, max_norm)
09:27:14 | Ran out of memory, skipping batch. if this happens frequently, decrease batchsize or truncate the inputs to the model.
Traceback (most recent call last):
  File "/home/rg4312/ParlAI/parlai/core/torch_generator_agent.py", line 603, in _fake_forward_backward_pass
    loss = 0 * self.compute_loss(self._dummy_batch)
  File "/home/rg4312/ParlAI/parlai/core/torch_generator_agent.py", line 693, in compute_loss
    model_output = self.model(*self._model_input(batch), ys=batch.label_vec)
  File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/rg4312/ParlAI/parlai/core/torch_generator_agent.py", line 312, in forward
    scores, preds = self.decode_forced(encoder_states, ys)
  File "/home/rg4312/ParlAI/parlai/core/torch_generator_agent.py", line 181, in decode_forced
    latent, _ = self.decoder(inputs, encoder_states)
  File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/rg4312/ParlAI/parlai/agents/hugging_face/t5.py", line 59, in wrap
    ret = func(*args, **kwargs)
  File "/home/rg4312/ParlAI/parlai/agents/hugging_face/t5.py", line 274, in forward
    outputs = self.stack(
  File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 985, in forward
    layer_outputs = layer_module(
  File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 663, in forward
    cross_attention_outputs = self.layer[1](
  File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 578, in forward
    attention_output = self.EncDecAttention(
  File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 470, in forward
    query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)
  File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 96, in forward
    return F.linear(input, self.weight, self.bias)
  File "/ext3/miniconda3/envs/chatbot/lib/python3.8/site-packages/torch/nn/functional.py", line 1847, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 44.49 GiB total capacity; 43.46 GiB already allocated; 2.00 MiB free; 43.48 GiB reserved in total by PyTorch)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ext3/miniconda3/envs/chatbot/bin/parlai", line 33, in <module>
    sys.exit(load_entry_point('parlai', 'console_scripts', 'parlai')())
  File "/home/rg4312/ParlAI/parlai/__main__.py", line 14, in main
    superscript_main()
  File "/home/rg4312/ParlAI/parlai/core/script.py", line 325, in superscript_main
    return SCRIPT_REGISTRY[cmd].klass._run_from_parser_and_opt(opt, parser)
  File "/home/rg4312/ParlAI/parlai/core/script.py", line 108, in _run_from_parser_and_opt
    return script.run()
  File "/home/rg4312/ParlAI/parlai/scripts/train_model.py", line 998, in run
    return self.train_loop.train()
  File "/home/rg4312/ParlAI/parlai/scripts/train_model.py", line 950, in train
    for _train_log in self.train_steps():
  File "/home/rg4312/ParlAI/parlai/scripts/train_model.py", line 857, in train_steps
    world.parley()
  File "/home/rg4312/ParlAI/parlai/core/worlds.py", line 370, in parley
    acts[1] = agents[1].act()
  File "/home/rg4312/ParlAI/parlai/core/torch_agent.py", line 2143, in act
    response = self.batch_act([self.observation])[0]
  File "/home/rg4312/ParlAI/parlai/core/torch_agent.py", line 2234, in batch_act
    output = self.train_step(batch)
  File "/home/rg4312/ParlAI/parlai/core/torch_generator_agent.py", line 759, in train_step
    self._fake_forward_backward_pass()
  File "/home/rg4312/ParlAI/parlai/core/torch_generator_agent.py", line 614, in _fake_forward_backward_pass
    raise RuntimeError(m)
RuntimeError: CUDA OOM: Lower batch size (-bs) from 1 or lower  max sequence length (-tr) from 128

klshuster commented 2 years ago

from a cursory examination, it looks like this might be failing because we can't fit the model activations in the first gpu spot. i'll need to investigate a bit more

github-actions[bot] commented 2 years ago

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

facebookresearch / ParlAI

Is T5 3B training properly parallelizing? #4559

Command

Error message