Closed MrBreadWater closed 6 years ago
Try decreasing batch size. Approximately 3 samples per gb of GPU memory.
I did, all the way down to 16, but that didn't change anything. I can go lower, but that seems suspiciously low.
On Fri, Jun 8, 2018 at 7:58 PM Rafael Valle notifications@github.com wrote:
Try decreasing batch size.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/tacotron2/issues/33#issuecomment-395934526, or mute the thread https://github.com/notifications/unsubscribe-auth/AWeVKiQcBKhAHMFsuHGxk43NqL6QzXCkks5t6znngaJpZM4UhIuJ .
I don't know from memory but I think the GTX 1050 TI has 4gb of memory, thus you should use a batch of 12 samples. Please try with batch size 12 and let us know.
same problem possibly, also GTX 1050 Ti 4gb, no luck with batch size 12 https://pastebin.com/di1j2jKQ
Try batch size 8.
out of memory error is gone, but python crashes after saving model : Unhandled exception at 0x0000000076FAA0F2 (ntdll.dll) in python.exe: 0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x0000000000123FF8).
python 3.5, 3.6 same thing maybe unrelated
@tatamyans can you provide a full error trace?
sorry, can't provide full trace now, maybe later, thanks
/home/soul/projects/nv-tacotron/tacotron2/layers.py:35: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
/home/soul/projects/nv-tacotron/tacotron2/layers.py:15: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
gain=torch.nn.init.calculate_gain(w_init_gain))
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-4-82530a2f6baf> in <module>()
1 checkpoint_path = "/home/soul/projects/nv-tacotron/tacotron2/outdir/checkpoint_80000"
----> 2 model = load_model(hparams)
3 try:
4 model = model.module
5 except:
/home/soul/projects/nv-tacotron/tacotron2/train.py in load_model(hparams)
78
79 def load_model(hparams):
---> 80 model = Tacotron2(hparams).cuda()
81 if hparams.fp16_run:
82 model = batchnorm_to_float(model.half())
/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in cuda(self, device)
247 Module: self
248 """
--> 249 return self._apply(lambda t: t.cuda(device))
250
251 def cpu(self):
/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
174 def _apply(self, fn):
175 for module in self.children():
--> 176 module._apply(fn)
177
178 for param in self._parameters.values():
/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
180 # Tensors stored in modules are graph leaves, and we don't
181 # want to create copy nodes, so we have to unpack the data.
--> 182 param.data = fn(param.data)
183 if param._grad is not None:
184 param._grad.data = fn(param._grad.data)
/home/soul/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in <lambda>(t)
247 Module: self
248 """
--> 249 return self._apply(lambda t: t.cuda(device))
250
251 def cpu(self):
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCTensorRandom.cu:25
I get this when I try to do inference on 1080ti.
Training works fine on separate gpu with batch size of 40 all other settings are default. Dataset is LJSpeech-1.1
@rafaelvalle could you please advise here?
Hm, perhaps I figured this out: PyTorch demanded GPU0 for inference. And as training was happening on in at the time - it gave OOM error. could produce some speech after I stopped training.
@gsoul yes, one can run into OOM if multiple sources are requesting memory from the same GPU. If you want to train and do inference at the same time, you could do inference on the CPU...
No, I have 2 * 1080ti in my machine. And used CUDA_VISIBLE_DEVICES to separate inference and training into separate gpus, but was getting error above, until stopped training on gpu0.
When running inference, can you confirm that the pytorch code only has access to one of the GPUs?
After thinking for some time - perhaps not. I ran:
CUDA_VISIBLE_DEVICE=1 /home/soul/anaconda3/bin/ipython notebook --no-browser --port=8889
But such command limits only ipython process rather than python process that communicates with gpus...
I had the same problem and I reduced the batch size to make it work
Closing due to inactivity.
I fixed this by reducing the batch size to 32 in hparam:
batch_size=32
I use Nvidia GTX 1080 Ti with 11GiB memory
Tutorial: Training on GPU with Colab, Inference with CPU on Server here.
Hello everyone, I am having the same GPU memory issue.
Since im already using a batchsize of 8, i dont think lowering it any lower would beneficial. That said, I'm thinking of trying to clear the cache (torch.cuda.empty_cache()) at the end of every epoch because i think its accumulating and filling up the cache since the memory is consistently popping up after 3 epoch.
Will let you know if it works out.
UPDATE (23/04/14): So I'm running the training on server using slurm, and after doing the above the slurm job automatically gets killed. So seems clearing the cache after each epoch doesn't seem to work...
I finally got all the errors resolved, but then this new one came up:
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
Here is the full log:
~/tacotron2$ python3 train.py --output_directory=outdir --log_directory=logdir FP16 Run: False Dynamic Loss Scaling: True Distributed Run: False cuDNN Enabled: True cuDNN Benchmark: False /home/mrbreadwater/tacotron2/layers.py:35: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_. self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain)) /home/mrbreadwater/tacotron2/layers.py:15: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_. gain=torch.nn.init.calculate_gain(w_init_gain)) Epoch: 0 THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory Traceback (most recent call last): File "train.py", line 291, in <module> args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) File "train.py", line 216, in train y_pred = model(x) File "/home/mrbreadwater/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__ result = self.forward(*input, **kwargs) File "/home/mrbreadwater/tacotron2/model.py", line 510, in forward encoder_outputs, targets, memory_lengths=input_lengths) File "/home/mrbreadwater/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__ result = self.forward(*input, **kwargs) File "/home/mrbreadwater/tacotron2/model.py", line 403, in forward decoder_input) File "/home/mrbreadwater/tacotron2/model.py", line 363, in decode attention_weights_cat, self.mask) File "/home/mrbreadwater/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__ result = self.forward(*input, **kwargs) File "/home/mrbreadwater/tacotron2/model.py", line 77, in forward attention_hidden_state, processed_memory, attention_weights_cat) File "/home/mrbreadwater/tacotron2/model.py", line 60, in get_alignment_energies processed_query + processed_attention_weights + processed_memory)) RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
I'm using a GTX 1050 Ti. Anything I can do to fix it?
EDIT: I'm running Ubuntu 18.04