CUDA Out of Memory - Githubissues

f1amigo commented 1 year ago

Hi, I've been following your instructions on the README to train the model. However, I ran into CUDA out-of-memory issues even with a 16 GB GPU.

I have tried to solve the above with the following solutions, but none has worked so far:

Decreasing batch_size to 4 -> 2 -> 1
Decreasing num_data_workers to 2 -> 1
Use torch.cuda.empty_cache()
Use gc.collect()
Use Google Colab and Kaggle to run a Notebook version

For point 5, the training is able to run up until the end of validation at the first epoch before the entire website crashes.

The following is a more detailed log of the error I received:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 139, in _wrapping_function results = function(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 624, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1061, in _run results = self._run_stage() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1140, in _run_stage self._run_train() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1163, in _run_train self.fit_loop.run() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance batch_output = self.batch_loop.run(kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 239, in _run_optimization closure() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in __call__ self._result = self.closure(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure step_output = self._step_fn() File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values()) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1443, in _call_strategy_hook output = fn(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 280, in training_step return self.model(*args, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward output = self._forward_module.training_step(*inputs, **kwargs) File "/home/chenweiyi/Grapher/model/litgrapher.py", line 102, in training_step logits_nodes, logits_edges= self.model(text_input_ids, File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/Grapher/model/grapher.py", line 58, in forward output = self.transformer(input_ids=text, File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1660, in forward decoder_outputs = self.decoder( File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1052, in forward layer_outputs = layer_module( File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 684, in forward self_attention_outputs = self.layer[0]( File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 590, in forward attention_output = self.SelfAttention( File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/chenweiyi/.conda/envs/grapher/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 520, in forward scores = torch.matmul( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.92 GiB total capacity; 10.48 GiB already allocated; 11.50 MiB free; 10.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

imelnyk commented 1 year ago

We tested the code on V100 (32GB) and A100(40GB), and it works fine. I think it might be challenging to fit the model and training on 16GB GPU.

f1amigo commented 1 year ago

Thanks for the reply. In that case, do you have a pre-trained model available instead? From my understanding, the current model is not trained, but do correct me if I'm wrong.

imelnyk commented 1 year ago

Due to some internal policies we have not released the trained models yet. For now, please use the provided code to train your ow model. To reduce the memory usage, try setting --pretrained_model t5-base instead of t5-large.

f1amigo commented 1 year ago

Thank you so much for your suggestion, I was able to train the model with t5-base.

realnuo commented 4 months ago

Hi, I am trying to use t5-base, but I meet this problem: IndexError: piece id is out of range. Mabye the input text contains a token that is not in the tokenizer vocabulary. Is there any way to solve it?

MiniKrishna commented 3 months ago

Hi, I am trying to use t5-base, but I meet this problem: IndexError: piece id is out of range. Mabye the input text contains a token that is not in the tokenizer vocabulary. Is there any way to solve it?

Have you been able to solve this problem?

r3po-1s-Tr3e commented 3 months ago

Thank you so much for your suggestion, I was able to train the model with t5-base.

would you be able share the trained model..? it would really be a great help!

ZXMJ commented 1 week ago

Hi, I am trying to use t5-base, but I meet this problem: IndexError: piece id is out of range. Mabye the input text contains a token that is not in the tokenizer vocabulary. Is there any way to solve it?

Have you been able to solve this problem?

Have you been able to solve this problem? thanks

IBM / Grapher

CUDA Out of Memory #3