NVIDIA / flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
https://nv-adlr.github.io/Flowtron
Apache License 2.0
889 stars 177 forks source link

docker fp16 CUDNN_STATUS_BAD_PARAM #95

Closed serg06 closed 3 years ago

serg06 commented 3 years ago

System:

Problem:

Whenever I try to run train.py, with "fp16_run": true, it fails immediately on the first epoch:

Epoch: 0
Traceback (most recent call last):
  File "train.py", line 472, in <module>
    train(n_gpus, rank, **train_config)
  File "train.py", line 358, in train
    mel, speaker_vecs, text, in_lens, out_lens, attn_prior)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/flowtron/flowtron.py", line 690, in forward
    text = self.encoder(text, in_lens)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/flowtron/flowtron.py", line 356, in forward
    outputs, _ = self.lstm(x)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 580, in forward
    self.num_layers, self.dropout, self.training, self.bidirectional)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

What I've tried:

Workarounds:

Debugging:

kevjshih commented 3 years ago

I think the mixed precision training isn't playing well with the lstm call. Can you try wrapping its context using 'with autocast(enabled=False):"

serg06 commented 3 years ago

@kevjshih I tried this:

        self.lstm.flatten_parameters()
        with amp.autocast(enabled=False):
            outputs, _ = self.lstm(x)

and this:

        with amp.autocast(enabled=False):
            self.lstm.flatten_parameters()
            outputs, _ = self.lstm(x)
and neither worked, I got the exact same error. ``` Epoch: 0 Traceback (most recent call last): File "train.py", line 472, in train(n_gpus, rank, **train_config) File "train.py", line 358, in train mel, speaker_vecs, text, in_lens, out_lens, attn_prior) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, **kwargs) File "/workspace/flowtron/flowtron.py", line 692, in forward text = self.encoder(text, in_lens) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, **kwargs) File "/workspace/flowtron/flowtron.py", line 358, in forward outputs, _ = self.lstm(x) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 580, in forward self.num_layers, self.dropout, self.training, self.bidirectional) RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM ```
rafaelvalle commented 3 years ago

make sure to update to the latest version of pytorch. closing due to inactivity.