WaveNet multi-GPU training encounter error

wan-wei commented 5 years ago

I intended to train WaveNet with multi-GPU setting, since the hyper params in paper_hparams.py:

wavenet_batch_size=8
layers=24
hidden_size=512

will meet the OOM issue in single GPU setting, unless I decrease one of the params above.

I also notice that there already has the multi-GPU logic in wavenet_vocoder/train.py, however when I start my program with python train.py --hparams wavenet_batch_size=4,wavenet_num_gpus=2, I encounter the following error message:

Traceback (most recent call last):
  File "train.py", line 138, in <module>
    main()
  File "train.py", line 132, in main
    train(args, log_dir, hparams)
  File "train.py", line 83, in train
    checkpoint = wavenet_train(args, log_dir, hparams, input_path)
  File "/home/weiwan/tts/Tacotron-2/wavenet_vocoder/train.py", line 346, in wavenet_train
    return train(log_dir, args, hparams, input_path)
  File "/home/weiwan/tts/Tacotron-2/wavenet_vocoder/train.py", line 231, in train
    eval_model = model_test_mode(args, feeder, hparams, global_step)
  File "/home/weiwan/tts/Tacotron-2/wavenet_vocoder/train.py", line 190, in model_test_mode
    feeder.eval_input_lengths)
  File "/home/weiwan/tts/Tacotron-2/wavenet_vocoder/models/wavenet.py", line 372, in initialize
    softmax=False, quantize=True, log_scale_min=hparams.log_scale_min, log_scale_min_gauss=hparams.log_scale_min_gauss)
  File "/home/weiwan/tts/Tacotron-2/wavenet_vocoder/models/wavenet.py", line 891, in incremental
    swap_memory=self._hparams.wavenet_swap_with_cpu)
  File "/home/weiwan/tf1.10/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3232, in while_loop
    return_same_structure)
  File "/home/weiwan/tf1.10/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2952, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/weiwan/tf1.10/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2887, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/weiwan/tts/Tacotron-2/wavenet_vocoder/models/wavenet.py", line 822, in body
    x = self.first_conv.incremental_step(current_input)
  File "/home/weiwan/tts/Tacotron-2/wavenet_vocoder/models/modules.py", line 388, in incremental_step
    output = self(inputs, incremental=True, convolution_queue=unused_queue) #Drop unused queue
  File "/home/weiwan/tf1.10/lib/python3.5/site-packages/tensorflow/python/keras/engine/base_layer.py", line 736, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/home/weiwan/tts/Tacotron-2/wavenet_vocoder/models/modules.py", line 384, in call
    return super(Conv1D1x1, self).call(inputs, incremental=incremental, convolution_queue=convolution_queue)
  File "/home/weiwan/tts/Tacotron-2/wavenet_vocoder/models/modules.py", line 295, in call
    output = tf.matmul(tf.reshape(inputs, [batch_size, -1]), self.linearized_weights)
  File "/home/weiwan/tf1.10/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 2018, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/home/weiwan/tf1.10/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4456, in mat_mul
    name=name)
  File "/home/weiwan/tf1.10/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/weiwan/tf1.10/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
    return func(*args, **kwargs)
  File "/home/weiwan/tf1.10/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
    op_def=op_def)
  File "/home/weiwan/tf1.10/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1746, in __init__
    self._control_flow_post_processing()
  File "/home/weiwan/tf1.10/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1755, in _control_flow_post_processing
    control_flow_util.CheckInputFromValidContext(self, input_tensor.op)
  File "/home/weiwan/tf1.10/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_util.py", line 314, in CheckInputFromValidContext
    raise ValueError(error_msg + " See info log for more details.")
ValueError: Cannot use 'WaveNet_model_1/inference_1/while/input_convolution/input_convolution/input_convolution/MatMul' as input to 'WaveNet_model_1/inference/while/input_convolution/Reshape' because they are in different while loops. See info log for more details.

mahdeto commented 5 years ago

can confirm the issue

venxca123 commented 5 years ago

same error here

liangwq commented 5 years ago

same error here, when try 1 GPU,the error like Traceback (most recent call last): File "/home/cuimi/anaconda3/envs/tensorflow3/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/cuimi/anaconda3/envs/tensorflow3/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/data/liangweiqi/Tacotron-2/wavenet_vocoder/feeder.py", line 230, in _enqueue_next_test_group self._session.run(self._eval_enqueue_op, feed_dict=feed_dict) File "/home/cuimi/anaconda3/envs/tensorflow3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/home/cuimi/anaconda3/envs/tensorflow3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/home/cuimi/anaconda3/envs/tensorflow3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/home/cuimi/anaconda3/envs/tensorflow3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.CancelledError: Run call was cancelled

jkkj1630 commented 4 years ago

i got same error, when i train on multi gpus.

Rayhane-mamah / Tacotron-2

WaveNet multi-GPU training encounter error #327