CUDA Runtime Error: Out of Memory

MrBreadWater commented 6 years ago

I finally got all the errors resolved, but then this new one came up: RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

Here is the full log: ~/tacotron2$ python3 train.py --output_directory=outdir --log_directory=logdir FP16 Run: False Dynamic Loss Scaling: True Distributed Run: False cuDNN Enabled: True cuDNN Benchmark: False /home/mrbreadwater/tacotron2/layers.py:35: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_. self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain)) /home/mrbreadwater/tacotron2/layers.py:15: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_. gain=torch.nn.init.calculate_gain(w_init_gain)) Epoch: 0 THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory Traceback (most recent call last): File "train.py", line 291, in <module> args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) File "train.py", line 216, in train y_pred = model(x) File "/home/mrbreadwater/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__ result = self.forward(*input, **kwargs) File "/home/mrbreadwater/tacotron2/model.py", line 510, in forward encoder_outputs, targets, memory_lengths=input_lengths) File "/home/mrbreadwater/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__ result = self.forward(*input, **kwargs) File "/home/mrbreadwater/tacotron2/model.py", line 403, in forward decoder_input) File "/home/mrbreadwater/tacotron2/model.py", line 363, in decode attention_weights_cat, self.mask) File "/home/mrbreadwater/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__ result = self.forward(*input, **kwargs) File "/home/mrbreadwater/tacotron2/model.py", line 77, in forward attention_hidden_state, processed_memory, attention_weights_cat) File "/home/mrbreadwater/tacotron2/model.py", line 60, in get_alignment_energies processed_query + processed_attention_weights + processed_memory)) RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

I'm using a GTX 1050 Ti. Anything I can do to fix it?

EDIT: I'm running Ubuntu 18.04

rafaelvalle commented 6 years ago

Try decreasing batch size. Approximately 3 samples per gb of GPU memory.

MrBreadWater commented 6 years ago

I did, all the way down to 16, but that didn't change anything. I can go lower, but that seems suspiciously low.

On Fri, Jun 8, 2018 at 7:58 PM Rafael Valle notifications@github.com wrote:

Try decreasing batch size.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/tacotron2/issues/33#issuecomment-395934526, or mute the thread https://github.com/notifications/unsubscribe-auth/AWeVKiQcBKhAHMFsuHGxk43NqL6QzXCkks5t6znngaJpZM4UhIuJ .

rafaelvalle commented 6 years ago

I don't know from memory but I think the GTX 1050 TI has 4gb of memory, thus you should use a batch of 12 samples. Please try with batch size 12 and let us know.

tatamyans commented 6 years ago

same problem possibly, also GTX 1050 Ti 4gb, no luck with batch size 12 https://pastebin.com/di1j2jKQ

rafaelvalle commented 6 years ago

Try batch size 8.

tatamyans commented 6 years ago

out of memory error is gone, but python crashes after saving model : Unhandled exception at 0x0000000076FAA0F2 (ntdll.dll) in python.exe: 0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x0000000000123FF8).

python 3.5, 3.6 same thing maybe unrelated

rafaelvalle commented 6 years ago

@tatamyans can you provide a full error trace?

tatamyans commented 6 years ago

sorry, can't provide full trace now, maybe later, thanks

gsoul commented 6 years ago

/home/soul/projects/nv-tacotron/tacotron2/layers.py:35: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
  self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
/home/soul/projects/nv-tacotron/tacotron2/layers.py:15: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
  gain=torch.nn.init.calculate_gain(w_init_gain))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-82530a2f6baf> in <module>()
      1 checkpoint_path = "/home/soul/projects/nv-tacotron/tacotron2/outdir/checkpoint_80000"
----> 2 model = load_model(hparams)
      3 try:
      4     model = model.module
      5 except:

/home/soul/projects/nv-tacotron/tacotron2/train.py in load_model(hparams)
     78 
     79 def load_model(hparams):
---> 80     model = Tacotron2(hparams).cuda()
     81     if hparams.fp16_run:
     82         model = batchnorm_to_float(model.half())

/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in cuda(self, device)
    247             Module: self
    248         """
--> 249         return self._apply(lambda t: t.cuda(device))
    250 
    251     def cpu(self):

/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    174     def _apply(self, fn):
    175         for module in self.children():
--> 176             module._apply(fn)
    177 
    178         for param in self._parameters.values():

/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    180                 # Tensors stored in modules are graph leaves, and we don't
    181                 # want to create copy nodes, so we have to unpack the data.
--> 182                 param.data = fn(param.data)
    183                 if param._grad is not None:
    184                     param._grad.data = fn(param._grad.data)

/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in <lambda>(t)
    247             Module: self
    248         """
--> 249         return self._apply(lambda t: t.cuda(device))
    250 
    251     def cpu(self):

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCTensorRandom.cu:25

I get this when I try to do inference on 1080ti.

Training works fine on separate gpu with batch size of 40 all other settings are default. Dataset is LJSpeech-1.1

@rafaelvalle could you please advise here?

gsoul commented 6 years ago

Hm, perhaps I figured this out: PyTorch demanded GPU0 for inference. And as training was happening on in at the time - it gave OOM error. could produce some speech after I stopped training.

rafaelvalle commented 6 years ago

@gsoul yes, one can run into OOM if multiple sources are requesting memory from the same GPU. If you want to train and do inference at the same time, you could do inference on the CPU...

gsoul commented 6 years ago

No, I have 2 * 1080ti in my machine. And used CUDA_VISIBLE_DEVICES to separate inference and training into separate gpus, but was getting error above, until stopped training on gpu0.

rafaelvalle commented 6 years ago

When running inference, can you confirm that the pytorch code only has access to one of the GPUs?

gsoul commented 6 years ago

After thinking for some time - perhaps not. I ran:

CUDA_VISIBLE_DEVICE=1 /home/soul/anaconda3/bin/ipython notebook --no-browser --port=8889

But such command limits only ipython process rather than python process that communicates with gpus...

imirzadeh commented 6 years ago

I had the same problem and I reduced the batch size to make it work

rafaelvalle commented 6 years ago

Closing due to inactivity.

adjouama commented 5 years ago

I fixed this by reducing the batch size to 32 in hparam: batch_size=32

I use Nvidia GTX 1080 Ti with 11GiB memory

ErfolgreichCharismatisch commented 3 years ago

Tutorial: Training on GPU with Colab, Inference with CPU on Server here.

one1ine commented 1 year ago

Hello everyone, I am having the same GPU memory issue.

Using Nvidia A40 with 46GB memory
Using a batchsize of 8!
Using a custom dataset of 40hrs with a sampling rate of 22050, which is like 6GB of data. The training is initially running fine but just around finishing 3 epochs, the memory error pops up and stops the training.

Since im already using a batchsize of 8, i dont think lowering it any lower would beneficial. That said, I'm thinking of trying to clear the cache (torch.cuda.empty_cache()) at the end of every epoch because i think its accumulating and filling up the cache since the memory is consistently popping up after 3 epoch.

Will let you know if it works out.

UPDATE (23/04/14): So I'm running the training on server using slurm, and after doing the above the slurm job automatically gets killed. So seems clearing the cache after each epoch doesn't seem to work...

NVIDIA / tacotron2

CUDA Runtime Error: Out of Memory #33