Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.26k stars 905 forks source link

Exiting due to exception: OOM when allocating tensor with shape #158

Closed avivelor closed 6 years ago

avivelor commented 6 years ago

Hi all,

I'm currently running tacotron-2 with the following details: Ubuntu 16.04, 64-bit Tensorflow-gpu built from source (1.8) Python: 3.6 (anaconda) CUDA/CUDNN: 9.0/7.1 GPU: Nvidia GTX 1080

I'm able to train tacotron fine, but when I hit wavenet I get the following error: "Exiting due to exception: OOM when allocating tensor with shape[9216,4,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/Pad, model/optimizer/gradients/model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose_3_grad/InvertPermutation)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info."

Does anyone know why I'm getting this error and or how to fix it? Thank you for your time!

Full details can be found here: `#############################################################

Wavenet Train

###########################################################

Checkpoint_path: logs-Tacotron-2/wave_pretrained/wavenet_model.ckpt Loading training data from: tacotron_output/gta/map.txt Using model: Tacotron-2 Hyperparameters: allow_clipping_in_normalization: True attention_dim: 128 attention_filters: 32 attention_kernel: (31,) cin_channels: 80 cleaners: english_cleaners clip_mels_length: True cross_entropy_pos_weight: 1 cumulative_weights: True decoder_layers: 2 decoder_lstm_units: 1024 embedding_dim: 512 enc_conv_channels: 512 enc_conv_kernel_size: (5,) enc_conv_num_layers: 3 encoder_lstm_units: 256 fmax: 7600 fmin: 0 frame_shift_ms: None freq_axis_kernel_size: 3 gate_channels: 512 gin_channels: -1 griffin_lim_iters: 60 hop_size: 300 input_type: raw kernel_size: 3 layers: 30 leaky_alpha: 0.4 log_scale_min: -32.23619130191664 log_scale_min_gauss: -16.11809565095832 mask_decoder: False mask_encoder: False max_abs_value: 4.0 max_iters: 1000 max_mel_frames: 1300 max_time_sec: None max_time_steps: 8000 min_level_db: -100 n_fft: 2048 n_speakers: 5 natural_eval: False normalize_for_wavenet: True num_freq: 1025 num_mels: 80 out_channels: 2 outputs_per_step: 2 postnet_channels: 512 postnet_kernel_size: (5,) postnet_num_layers: 5 power: 1.5 predict_linear: True prenet_layers: [256, 256] quantize_channels: 65536 ref_level_db: 20 rescale: True rescaling_max: 0.999 residual_channels: 512 sample_rate: 24000 signal_normalization: True silence_threshold: 2 skip_out_channels: 256 smoothing: False stacks: 3 stop_at_any: True symmetric_mels: False tacotron_adam_beta1: 0.9 tacotron_adam_beta2: 0.999 tacotron_adam_epsilon: 1e-06 tacotron_batch_size: 32 tacotron_clip_gradients: False tacotron_data_random_state: 1234 tacotron_decay_learning_rate: True tacotron_decay_rate: 0.4 tacotron_decay_steps: 50000 tacotron_dropout_rate: 0.5 tacotron_final_learning_rate: 1e-05 tacotron_initial_learning_rate: 0.001 tacotron_random_seed: 5339 tacotron_reg_weight: 1e-06 tacotron_scale_regularization: True tacotron_start_decay: 50000 tacotron_swap_with_cpu: False tacotron_synthesis_batch_size: 512 tacotron_teacher_forcing_decay_alpha: 0.0 tacotron_teacher_forcing_decay_steps: 280000 tacotron_teacher_forcing_final_ratio: 0.0 tacotron_teacher_forcing_init_ratio: 1.0 tacotron_teacher_forcing_mode: constant tacotron_teacher_forcing_ratio: 1.0 tacotron_teacher_forcing_start_decay: 10000 tacotron_test_batches: 48 tacotron_test_size: None tacotron_zoneout_rate: 0.1 train_with_GTA: True trim_fft_size: 512 trim_hop_size: 128 trim_silence: True trim_top_db: 23 upsample_conditional_features: True upsample_scales: [15, 20] use_bias: True use_lws: False use_speaker_embedding: True wavenet_adam_beta1: 0.9 wavenet_adam_beta2: 0.999 wavenet_adam_epsilon: 1e-08 wavenet_batch_size: 4 wavenet_data_random_state: 1234 wavenet_dropout: 0.05 wavenet_ema_decay: 0.9999 wavenet_learning_rate: 0.001 wavenet_random_seed: 5339 wavenet_swap_with_cpu: False wavenet_synthesis_batch_size: 4 wavenet_test_batches: None wavenet_test_size: 0.0441 win_size: 1200 Initializing Wavenet model. Dimensions (? = dynamic shape): Train mode: True Eval mode: False Synthesis mode: False inputs: (?, 1, ?) local_condition: (?, 80, ?) targets: (?, ?) outputs: (?, ?) Initializing Wavenet model. Dimensions (? = dynamic shape): Train mode: False Eval mode: True Synthesis mode: False local_condition: (1, 80, ?) targets: (?,) outputs: (?,) Wavenet training set to a maximum of 1300000 steps

Generated 32 train batches of size 4 in 0.105 sec

Generated 578 test batches of size 1 in 0.376 sec Exiting due to exception: OOM when allocating tensor with shape[9216,4,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/Pad, model/optimizer/gradients/model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose_3_grad/InvertPermutation)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Node: model/inference/residual_block_skip_conv_layer_ResidualConv1dGLU_29/strided_slice_2/_1509 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_10290...ed_slice_2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose', defined at: File "train.py", line 133, in main() File "train.py", line 127, in main train(args, log_dir, hparams) File "train.py", line 80, in train checkpoint = wavenet_train(args, log_dir, hparams, input_path) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/train.py", line 251, in wavenet_train return train(log_dir, args, hparams, input_path) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/train.py", line 175, in train model, stats = model_train_mode(args, feeder, hparams, global_step) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/train.py", line 123, in model_train_mode feeder.input_lengths, x=feeder.inputs) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/models/wavenet.py", line 176, in initialize y_hat = self.step(x, c, g, softmax=False) #softmax is automatically computed inside softmax_cross_entropy if needed File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/models/wavenet.py", line 481, in step x, h = conv(x, c, g_bct) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenetvocoder/models/modules.py", line 277, in call x, s, = self.step(x, c, g, False) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/models/modules.py", line 301, in step x = self.conv(x) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenetvocoder/models/modules.py", line 162, in call inputs = self._to_dilation(inputs) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/models/modules.py", line 117, in _to_dilation inputs_transposed = tf.transpose(inputs_padded, [2, 0, 1]) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1408, in transpose ret = transpose_fn(a, perm, name=name) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 8636, in transpose "Transpose", x=x, perm=perm, name=name) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3414, in create_op op_def=op_def) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1740, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[9216,4,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/Pad, model/optimizer/gradients/model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose_3_grad/InvertPermutation)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Node: model/inference/residual_block_skip_conv_layer_ResidualConv1dGLU_29/strided_slice_2/_1509 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_10290...ed_slice_2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Traceback (most recent call last): File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call return fn(*args) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[9216,4,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/Pad, model/optimizer/gradients/model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose_3_grad/InvertPermutation)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Node: model/inference/residual_block_skip_conv_layer_ResidualConv1dGLU_29/strided_slice_2/_1509 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_10290...ed_slice_2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/train.py", line 217, in train step, y_hat, loss, opt = sess.run([global_step, model.y_hat, model.loss, model.optimize]) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[9216,4,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/Pad, model/optimizer/gradients/model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose_3_grad/InvertPermutation)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Node: model/inference/residual_block_skip_conv_layer_ResidualConv1dGLU_29/strided_slice_2/_1509 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_10290...ed_slice_2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose', defined at: File "train.py", line 133, in main() File "train.py", line 127, in main train(args, log_dir, hparams) File "train.py", line 80, in train checkpoint = wavenet_train(args, log_dir, hparams, input_path) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/train.py", line 251, in wavenet_train return train(log_dir, args, hparams, input_path) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/train.py", line 175, in train model, stats = model_train_mode(args, feeder, hparams, global_step) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/train.py", line 123, in model_train_mode feeder.input_lengths, x=feeder.inputs) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/models/wavenet.py", line 176, in initialize y_hat = self.step(x, c, g, softmax=False) #softmax is automatically computed inside softmax_cross_entropy if needed File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/models/wavenet.py", line 481, in step x, h = conv(x, c, g_bct) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenetvocoder/models/modules.py", line 277, in call x, s, = self.step(x, c, g, False) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/models/modules.py", line 301, in step x = self.conv(x) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenetvocoder/models/modules.py", line 162, in call inputs = self._to_dilation(inputs) File "/home/wbemergingtech/Desktop/Tacotron2-Rayhane-Implenetation/Tacotron-2/wavenet_vocoder/models/modules.py", line 117, in _to_dilation inputs_transposed = tf.transpose(inputs_padded, [2, 0, 1]) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1408, in transpose ret = transpose_fn(a, perm, name=name) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 8636, in transpose "Transpose", x=x, perm=perm, name=name) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3414, in create_op op_def=op_def) File "/home/wbemergingtech/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1740, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[9216,4,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/Pad, model/optimizer/gradients/model/inference/residual_block_conv_layer_ResidualConv1dGLU_29/transpose_3_grad/InvertPermutation)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[Node: model/inference/residual_block_skip_conv_layer_ResidualConv1dGLU_29/strided_slice_2/_1509 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_10290...ed_slice_2", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Traceback (most recent call last): File "train.py", line 133, in main() File "train.py", line 127, in main train(args, log_dir, hparams) File "train.py", line 82, in train raise ('Error occured while training Wavenet, Exiting!') TypeError: exceptions must derive from BaseException `

Rayhane-mamah commented 6 years ago

Hello,

That's a GPU memory error. It happens when your model uses all available memory on the GPU. GTX 1080 has 8Gb of VRAM I believe? To avoid this, you can reduce wavenet_batch_size from 4 to 2 (in hparams.py). That will take off the error.

If you want to take this a step further, you can increase max_time_steps to the maximal value you can use before the model does OOM again. (After you had reduced the batch size of course). That will ensure training goes smoother.

avivelor commented 6 years ago

@Rayhane-mamah thanks for the speedy reply, It worked!

Rayhane-mamah commented 6 years ago

Glad that worked out :)