Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.27k stars 905 forks source link

Test "python3 train.py --model='WaveNet' ",get exception "Conv2DCustomBackpropInputOp only supports NHWC." #140

Open DayanJ opened 6 years ago

DayanJ commented 6 years ago

I used LJSpeech-1.1 data to test 1.After I have run 'Python3 wavenet_preprocess.py', I can get these files. default 2.I have modifed 'hparams.py' , set "train_with_GTA" to False. 3.After I have run "Python3 train --model='WaveNet' ,I got these errors. default

My tensorflow version is 1.7.1 and I can't fix this error.

DanRuta commented 6 years ago

What are your:

OS version, CUDA version, Python version, GPU, and have you got tensorflow gpu or cpu installed (or both)?

You can track this issue here, also: https://github.com/Rayhane-mamah/Tacotron-2/issues/73 https://github.com/Rayhane-mamah/Tacotron-2/issues/87

gloriouskilka commented 6 years ago

I have the same issue when I try to run this on CPU.

@DayanJ According to your log, the device placement is CPU, and, I guess, CPU version of Op supports only NHWC order.

If you're going to use your GPU, you should fix whatever prevents TensorFlow from placing your OP to GPU, probably by uninstalling CPU version of TensorFlow.

If you're going to use CPU, I guess you can fastfix this by reordering before this Op using tf.transpose like this https://stackoverflow.com/questions/37689423/convert-between-nhwc-and-nchw-in-tensorflow

Upd: tried fastfix, didn't work, I am TensorFlow noob, don't believe me. Upd2: something like that, probably. https://github.com/gloriouskilka/Tacotron-2-fork/commit/a56b3007e109fae439ec752d9a9ef5384732c789

Rayhane-mamah commented 6 years ago

Hello @DayanJ, as suggested by @gloriouskilka, please make sure you only have tensorflow gpu version installed. This is most likely a bug that occurs when you are trying to use CPU on Wavenet.

Hayes515 commented 6 years ago

Hi, @gloriouskilka , I am DayanJ, this is my new account. I didn't install tf gpu before. I had tried for this advice( gloriouskilka@a56b300). It works for only using cpu. @Rayhane-mamah @DanRuta @gloriouskilka Thanks for your help.

gloriouskilka commented 6 years ago

@Hayes515 Hi! My fastfix is a bad idea, just proof of concept. You should switch to Nvidia GPU, if you have one, because you will train your network on CPU until the end of the days, I think.

Usually people install both tensorflow and tensorflow-gpu, and sometimes CPU version of tensorflow prevents GPU to be used, so the main advice is: uninstall tensorflow, install only tensorflow-gpu.

Hayes515 commented 6 years ago

@gloriouskilka Hi! you are right,I have switched to Nvidia GPU, but it took me two days to finish it. I installed some packages by Anaconda3 in a new environment T3.This way is convinient. default default

The condition of my GPU is below. default

Thank you!

gloriouskilka commented 6 years ago

@Hayes515 Yay! You're welcome!

I guess we can close this issue, because it alredy contains all possible solutions with nice screenshots.

Rayhane-mamah commented 6 years ago

One last thing before closing this, @Hayes515 you may want to keep your 2nd gpu free as it is holding the model graph for no particular reason. To do this, please add os.environ["CUDA_VISIBLE_DEVICES"] = "0" in the following location: https://github.com/Rayhane-mamah/Tacotron-2/blob/e244457bf5a9fa8d308e97d61cf3dc1933575488/train.py#L36-L37

That will prevent the run from seeing your 2nd GPU, it seems your graphic display is handled by it so there you go :) Naturally if you want to make multiple runs in parallel you can follow my comment here.

Feel free to close the issue if no other problems are related to this issue. Thanks for using our work ;)

yvt commented 6 years ago

I needed to run the model on CPU for a testing purpose (because a machine with GPU is currently occupied by another variation of this model) so I would be glad if it could run on CPU.

It looks like the "channel" part of the transposed convolution input is temporarily inserted here:

https://github.com/Rayhane-mamah/Tacotron-2/blob/d13dbba16f0a434843916b5a8647a42fe34544f5/wavenet_vocoder/models/wavenet.py#L467-L475

And here:

https://github.com/Rayhane-mamah/Tacotron-2/blob/d13dbba16f0a434843916b5a8647a42fe34544f5/wavenet_vocoder/models/wavenet.py#L549-L554

I guess that this issue, the restriction of the CPU implementation of Conv2DTranspose, can be worked around by inserting a new dimension as the last dimension (axis=3, NHWC) instead of as the second dimension (axis=1, NCHW). (Also don't forget to change data_format to channels_last)

I'm not entirely sure because this is based on the assumption that these are the only instances where Conv2DTranspose is used and I haven't gotten used to this code base yet. Also I'm not sure how this is fundamentally different from "fastfix"s mentioned by @gloriouskilka. I would really appreciate if someone could confirm if this is the right way to go.

KarolinaPondel commented 2 years ago

Hi guys,

I'm trying to get WaveNet training working and I keep getting this problem. I can't find its location and don't know how to fix it. Or is there an update on this issue? I only have tansorflow-gpu installed. Tacotron workout went through without any problems.

Exiting due to exception: Conv2DCustomBackpropInputOp only supports NHWC. [[node WaveNet_model/optimizer_1/gradients/WaveNet_model/inference/final_convolution_2/final_convolution_2_1/final_convolution_2/conv1d/Conv2D_grad/Conv2DBackpropInput (defined at /notebooks/Tacotron-2/wavenet_vocoder/models/wavenet.py:557) = Conv2DBackpropInput[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](WaveNet_model/optimizer_1/gradients/WaveNet_model/inference/final_convolution_2/final_convolution_2_1/final_convolution_2/conv1d/Conv2D_grad/ShapeN, WaveNet_model/inference/final_convolution_2/final_convolution_2_1/final_convolution_2/conv1d/ExpandDims_1, WaveNet_model/optimizer_1/gradients/WaveNet_model/inference/final_convolution_2/final_convolution_2_1/final_convolution_2/conv1d/Squeeze_grad/Reshape)]]

Caused by op 'WaveNet_model/optimizer_1/gradients/WaveNet_model/inference/final_convolution_2/final_convolution_2_1/final_convolution_2/conv1d/Conv2D_grad/Conv2DBackpropInput', defined at: File "train.py", line 138, in main() File "train.py", line 130, in main wavenet_train(args, log_dir, hparams, args.wavenet_input) File "/notebooks/Tacotron-2/wavenet_vocoder/train.py", line 346, in wavenet_train return train(log_dir, args, hparams, input_path) File "/notebooks/Tacotron-2/wavenet_vocoder/train.py", line 230, in train model, stats = model_train_mode(args, feeder, hparams, global_step) File "/notebooks/Tacotron-2/wavenet_vocoder/train.py", line 178, in model_train_mode model.add_optimizer(global_step) File "/notebooks/Tacotron-2/wavenet_vocoder/models/wavenet.py", line 557, in add_optimizer gradients = optimizer.compute_gradients(self.tower_loss[i]) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 519, in compute_gradients colocate_gradients_with_ops=colocate_gradients_with_ops) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py", line 630, in gradients gate_gradients, aggregation_method, stop_gradients) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py", line 814, in _GradientsHelper lambda: grad_fn(op, out_grads)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py", line 408, in _MaybeCompile return grad_fn() # Exit early File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py", line 814, in lambda: grad_fn(op, out_grads)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_grad.py", line 517, in _Conv2DGrad data_format=data_format), File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 1229, in conv2d_backprop_input dilations=dilations, name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

...which was originally created as op 'WaveNet_model/inference/final_convolution_2/final_convolution_2_1/final_convolution_2/conv1d/Conv2D', defined at: File "train.py", line 138, in main() [elided 2 identical lines from previous traceback] File "/notebooks/Tacotron-2/wavenet_vocoder/train.py", line 230, in train model, stats = model_train_mode(args, feeder, hparams, global_step) File "/notebooks/Tacotron-2/wavenet_vocoder/train.py", line 176, in model_train_mode feeder.input_lengths, x=feeder.inputs) File "/notebooks/Tacotron-2/wavenet_vocoder/models/wavenet.py", line 277, in initialize y_hat_train = self.step(tower_x[i], tower_c[i], tower_g[i], softmax=False) #softmax is automatically computed inside softmax_cross_entropy if needed File "/notebooks/Tacotron-2/wavenet_vocoder/models/wavenet.py", line 719, in step x = conv(x) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 757, in call outputs = self.call(inputs, *args, *kwargs) File "/notebooks/Tacotron-2/wavenet_vocoder/models/modules.py", line 382, in call return super(Conv1D1x1, self).call(inputs, incremental=incremental, convolution_queue=convolution_queue) File "/notebooks/Tacotron-2/wavenetvocoder/models/modules.py", line 319, in call outputs = self.layer.call(inputs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/layers/convolutional.py", line 384, in call return super(Conv1D, self).call(inputs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/layers/convolutional.py", line 194, in call outputs = self._convolution_op(inputs, self.kernel) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 868, in call return self.conv_op(inp, filter) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 520, in call return self.call(inp, filter) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 204, in call name=self.name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 193, in _conv1d name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 553, in new_func return func(args, **kwargs)