NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

Tacotron GST cuDNN launch failure #337

Closed stefbraun closed 5 years ago

stefbraun commented 5 years ago

Hi, I am trying to train tacotron-gst on a single GPU (11GB 2080 Ti). However, I get a CUDNN launch failure (see below). Can anyone help with this? Thanks

Versions: Ubuntu 16.04 Tensorflow 1.10 CUDA 9.0.176 CUDNN 7.0.5

Train-options: num_gpu=1 batch-size: 16

Epoch 0, global step 0: Train loss: 24.9268 time per step = 0:00:0.420 WARNING: Train_mag audio was clipped at step 0 WARNING: Train audio was clipped at step 0 2019-01-14 13:39:53.899786: E tensorflow/stream_executor/cuda/cuda_dnn.cc:81] CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(2454): 'cudnnConvolutionForward( cudnn.handle(), alpha, input_nd.handle(), input_data.opaque(), filter.handle(), filter_data.opaque(), conv.handle(), ToConvForwardAlgo(algo_desc), scratch.opaque(), scratch.size(), beta, output_nd.handle(), output_data->opaque())' Traceback (most recent call last): File "/home/brauns/anaconda3/envs/oseq/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call return fn(*args) File "/home/brauns/anaconda3/envs/oseq/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/brauns/anaconda3/envs/oseq/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: cuDNN launch failure : input shape([16,1,1,169]) filter shape([1,32,1,32]) [[Node: ForwardPass/tacotron_2_decoder/decoder/while/BasicDecoderStep/decoder/attention_wrapper/location_attention/location_conv/conv1d/Conv2D = Conv2D[T=DT_FLOAT, class=["loc:@Loss...ad/Reshape"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](ForwardPass/tacotron_2_decoder/decoder/while/BasicDecoderStep/decoder/attention_wrapper/location_attention/location_conv/conv1d/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, ForwardPass/tacotron_2_decoder/decoder/while/BasicDecoderStep/decoder/attention_wrapper/location_attention/location_conv/conv1d/ExpandDims_1)]] [[Node: ForwardPass/Loss/tacotron_loss/mean_squared_error/num_present/broadcast_weights/assert_broadcastable/AssertGuard/Assert/Switch_1/_2881 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge2840...t/Switch_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

blisc commented 5 years ago

Please update cuda to 10.0 and cudnn to 7.4 as support for Turing cards was added recently.

Afterwards, can you try running it inside our docker container? So that we can eliminate issues between tensorflow and other library versions. See here for instructions on using the docker.

Let us know if the issue persists

trevor-m commented 5 years ago

@blisc I am encountering the same segfault issue. I am using the gitlab-master.nvidia.com:5005/dl/dgx/tensorflow:master-py3-devel container but I have compiled tensorflow from the upstream master branch. I have CUDA 10.0.130 and cudnn 7.4 patch 2.

Edit: Looks like this is caused by DataLayer in upstream master being broken, so not an OpenSeq2Seq bug.

borisgin commented 5 years ago

yes, we see this issue only with latest TF maser. Nvidia container "1.12" should be ok.

stefbraun commented 5 years ago

Thanks! I used the docker image, and the cuDNN launch failure went away. However, now I get a different error:

tensorflow.python.framework.errors_impl.UnknownError: FileNotFoundError: [Errno 2] No such file or directory: '/dataset/en_US/by_book/female/judy_bieber/the_master_key/wavs/the_master_key_05_f000135.wav'

I have followed the steps from the docs. Also, I have downloaded the dataset from here M-AILABS dataset from the link given in the docs. I have re-downloaded twice, but the file is not there.

Update: It looks like three files are missing when tacotron_gst_combine.py is creating train.csv:

  1. en_US/by_book/female/judy_bieber/the_master_key/wavs/the_master_key_05_f000135.wav
  2. en_US/by_book/female/mary_ann/northandsouth/wavs/northandsouth_40_f000069.wav
  3. en_US/by_book/female/mary_ann/midnight_passenger/wavs/midnight_passenger_05_f000269.wav
borisgin commented 5 years ago

What is the location of the downloadef dataset? Is it under root /dataset:...?

blisc commented 5 years ago

The MAILABS dataset is not well organized. There exists files inside the csv that does not have audio clips. You will have to manually remove these from the csv.

I’ll add a note to the docs

zhengwy888 commented 5 years ago

add this one line would filter non-existing file out.

    cleaned = files[files.apply(lambda x: os.path.exists(os.path.join(data_root, x['1'] + '.wav')), axis=1)]