docker fp16 CUDNN_STATUS_BAD_PARAM

serg06 commented 3 years ago

System:

Ubuntu 18.04
GTX 1070
Running with nvidia-docker

Problem:

Whenever I try to run train.py, with "fp16_run": true, it fails immediately on the first epoch:

Epoch: 0
Traceback (most recent call last):
  File "train.py", line 472, in <module>
    train(n_gpus, rank, **train_config)
  File "train.py", line 358, in train
    mel, speaker_vecs, text, in_lens, out_lens, attn_prior)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/flowtron/flowtron.py", line 690, in forward
    text = self.encoder(text, in_lens)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/flowtron/flowtron.py", line 356, in forward
    outputs, _ = self.lstm(x)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 580, in forward
    self.num_layers, self.dropout, self.training, self.bidirectional)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

What I've tried:

Workarounds:

Training with "fp16_run": false: Works successfully.
Training without using docker: Works successfully. (Until I run into issue a mysterious CUDNN_STATUS_EXECUTION_FAILED about 1,000 steps in.)

Debugging:

I tried printing the 6 values that get passed into model() right before the error occurs, with fp16 both enabled and disabled. Both print statements looked the same, with nothing obvious out of the ordinary.

kevjshih commented 3 years ago

I think the mixed precision training isn't playing well with the lstm call. Can you try wrapping its context using 'with autocast(enabled=False):"

serg06 commented 3 years ago

@kevjshih I tried this:

        self.lstm.flatten_parameters()
        with amp.autocast(enabled=False):
            outputs, _ = self.lstm(x)

and this:

        with amp.autocast(enabled=False):
            self.lstm.flatten_parameters()
            outputs, _ = self.lstm(x)

and neither worked, I got the exact same error.

``` Epoch: 0 Traceback (most recent call last): File "train.py", line 472, in train(n_gpus, rank, **train_config) File "train.py", line 358, in train mel, speaker_vecs, text, in_lens, out_lens, attn_prior) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, **kwargs) File "/workspace/flowtron/flowtron.py", line 692, in forward text = self.encoder(text, in_lens) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, **kwargs) File "/workspace/flowtron/flowtron.py", line 358, in forward outputs, _ = self.lstm(x) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 726, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 580, in forward self.num_layers, self.dropout, self.training, self.bidirectional) RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM ```

rafaelvalle commented 3 years ago

make sure to update to the latest version of pytorch. closing due to inactivity.

NVIDIA / flowtron