NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.55k stars 369 forks source link

Tacotron GST hangs on a single GPU #476

Open lorinczb opened 5 years ago

lorinczb commented 5 years ago

I was trying to run tacotron gst on a single GPU, but it hangs, after outputting "Successfully opened dynamic library libcublas.so.10.0" nothing happens.

I am running the project on google colaboratory, and did not downgrade or upgrade any components that are provided by colab.

I have copied the full output below. Also by running pip list and nvidia-smi I get the following versions and info: tensorflow 1.14.0
NVIDIA-SMI 418.67 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 | | N/A 33C P0 69W / 149W | 1281MiB / 11441MiB | 0% Default

`WARNING: Logging before flag parsing goes to stderr. W0718 11:59:23.139119 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/utils/hooks.py:15: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.

W0718 11:59:23.142410 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/utils/helpers.py:181: The name tf.train.SessionCreator is deprecated. Please use tf.compat.v1.train.SessionCreator instead.

W0718 11:59:23.142707 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/utils/helpers.py:240: The name tf.train.Scaffold is deprecated. Please use tf.compat.v1.train.Scaffold instead.

W0718 11:59:23.142882 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/utils/helpers.py:285: The name tf.train.SessionManager is deprecated. Please use tf.compat.v1.train.SessionManager instead.

W0718 11:59:24.736594 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/optimizers/mp_wrapper.py:27: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0718 11:59:25.409647 140435192031104 lazy_loader.py:50] The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

W0718 11:59:26.535829 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/parts/transformer/attention_layer.py:24: The name tf.layers.Layer is deprecated. Please use tf.compat.v1.layers.Layer instead.

W0718 11:59:26.557999 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/parts/cnns/tcn.py:8: The name tf.layers.Conv1D is deprecated. Please use tf.compat.v1.layers.Conv1D instead.

Starting training from scratch Training config: {'batch_size_per_gpu': 32, 'data_layer': <class 'open_seq2seq.data.text2speech.text2speech.Text2SpeechDataLayer'>, 'data_layer_params': {'data_min': {'magnitude': 1e-05, 'mel': 0.01}, 'dataset': 'MAILABS', 'dataset_files': ['/content/drive/My Drive/Colab ' 'Notebooks/NVIDIA/OpenSeq2Seq/train.csv'], 'dataset_location': '/content/drive/My Drive/Colab ' 'Notebooks/NVIDIA/OpenSeq2Seq', 'duration_max': 1024, 'duration_min': 24, 'exp_mag': True, 'feature_normalize': False, 'feature_normalize_mean': 0.0, 'feature_normalize_std': 1.0, 'mag_power': 1, 'mel_type': 'htk', 'num_audio_features': {'magnitude': 401, 'mel': 80}, 'output_type': 'both', 'pad_EOS': True, 'shuffle': True, 'style_input': 'wav', 'trim': True, 'vocab_file': '/content/drive/My Drive/Colab ' 'Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/test_utils/vocab_tts.txt'}, 'decoder': <class 'open_seq2seq.decoders.tacotron2_decoder.Tacotron2Decoder'>, 'decoder_params': {'attention_bias': True, 'attention_layer_size': 128, 'attention_type': 'location', 'decoder_cell_type': <class 'tensorflow.python.ops.rnn_cell_impl.LSTMCell'>, 'decoder_cell_units': 1024, 'decoder_layers': 2, 'dropout_prob': 0.1, 'enable_postnet': True, 'enable_prenet': True, 'mask_decoder_sequence': True, 'parallel_iterations': 32, 'postnet_conv_layers': [{'activation_fn': <function tanh at 0x7fb9751b2ae8>, 'kernel_size': [5], 'num_channels': 512, 'padding': 'SAME', 'stride': [1]}, {'activation_fn': <function tanh at 0x7fb9751b2ae8>, 'kernel_size': [5], 'num_channels': 512, 'padding': 'SAME', 'stride': [1]}, {'activation_fn': <function tanh at 0x7fb9751b2ae8>, 'kernel_size': [5], 'num_channels': 512, 'padding': 'SAME', 'stride': [1]}, {'activation_fn': <function tanh at 0x7fb9751b2ae8>, 'kernel_size': [5], 'num_channels': 512, 'padding': 'SAME', 'stride': [1]}, {'activation_fn': None, 'kernel_size': [5], 'num_channels': -1, 'padding': 'SAME', 'stride': [1]}], 'postnet_data_format': 'channels_last', 'postnet_keep_dropout_prob': 0.5, 'prenet_layers': 2, 'prenet_units': 256, 'zoneout_prob': 0.0}, 'dtype': tf.float32, 'encoder': <class 'open_seq2seq.encoders.tacotron2_encoder.Tacotron2Encoder'>, 'encoder_params': {'activation_fn': <function relu at 0x7fb974f480d0>, 'cnn_dropout_prob': 0.5, 'conv_layers': [{'kernel_size': [5], 'num_channels': 512, 'padding': 'SAME', 'stride': [1]}, {'kernel_size': [5], 'num_channels': 512, 'padding': 'SAME', 'stride': [1]}, {'kernel_size': [5], 'num_channels': 512, 'padding': 'SAME', 'stride': [1]}], 'data_format': 'channels_last', 'num_rnn_layers': 1, 'rnn_cell_dim': 256, 'rnn_dropout_prob': 0.0, 'rnn_type': <class 'tensorflow.contrib.cudnn_rnn.python.layers.cudnn_rnn.CudnnLSTM'>, 'rnn_unidirectional': False, 'src_emb_size': 512, 'style_embedding_enable': True, 'style_embedding_params': {'attention_layer_size': 512, 'conv_layers': [{'kernel_size': [3, 3], 'num_channels': 32, 'padding': 'SAME', 'stride': [2, 2]}, {'kernel_size': [3, 3], 'num_channels': 32, 'padding': 'SAME', 'stride': [2, 2]}, {'kernel_size': [3, 3], 'num_channels': 64, 'padding': 'SAME', 'stride': [2, 2]}, {'kernel_size': [3, 3], 'num_channels': 64, 'padding': 'SAME', 'stride': [2, 2]}, {'kernel_size': [3, 3], 'num_channels': 128, 'padding': 'SAME', 'stride': [2, 2]}, {'kernel_size': [3, 3], 'num_channels': 128, 'padding': 'SAME', 'stride': [2, 2]}], 'emb_size': 512, 'num_heads': 8, 'num_rnn_layers': 1, 'num_tokens': 32, 'rnn_cell_dim': 128, 'rnn_type': <class 'tensorflow.python.ops.rnn_cell_impl.GRUCell'>, 'rnn_unidirectional': True}, 'use_cudnn_rnn': True, 'zoneout_prob': 0.0}, 'eval_steps': 500, 'initializer': <function xavier_initializer at 0x7fb94be01598>, 'load_model': '', 'logdir': 'result/tacotron-gst-8gpu_7', 'loss': <class 'open_seq2seq.losses.text2speech_loss.Text2SpeechLoss'>, 'loss_params': {'use_mask': True}, 'lr_policy': <function exp_decay at 0x7fb944c38ea0>, 'lr_policy_params': {'begin_decay_at': 20000, 'decay_rate': 0.1, 'decay_steps': 10000, 'learning_rate': 0.001, 'min_lr': 1e-05, 'use_staircase_decay': False}, 'max_grad_norm': 1.0, 'num_epochs': 25, 'num_gpus': 1, 'optimizer': 'Adam', 'optimizer_params': {}, 'print_loss_steps': 50, 'print_samples_steps': 500, 'random_seed': 0, 'regularizer': <function l2_regularizer at 0x7fb94be51268>, 'regularizer_params': {'scale': 1e-06}, 'save_checkpoint_steps': 2500, 'save_summaries_steps': 50, 'save_to_tensorboard': True, 'summaries': ['learning_rate', 'variables', 'gradients', 'larc_summaries', 'variable_norm', 'gradient_norm', 'global_gradient_norm'], 'use_horovod': False, 'use_xla_jit': False} W0718 11:59:26.699374 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/models/model.py:312: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

W0718 11:59:27.308000 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/models/model.py:390: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W0718 11:59:27.308259 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/models/model.py:391: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead.

*** Building graph on GPU:0 W0718 11:59:27.324142 140435192031104 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0718 11:59:27.332295 140435192031104 deprecation.py:323] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/data/text2speech/text2speech.py:318: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, there are two options available in V2.

W0718 11:59:27.374233 140435192031104 deprecation.py:323] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/data/text2speech/text2speech.py:368: DatasetV1.make_initializable_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use for ... in dataset: to iterate over a dataset. If using tf.estimator, return the Dataset object directly from your input function. As a last resort, you can use tf.compat.v1.data.make_initializable_iterator(dataset). W0718 11:59:27.386699 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/encoders/tacotron2_encoder.py:139: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

W0718 11:59:27.404452 140435192031104 deprecation.py:323] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:159: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version. Instructions for updating: Use tf.keras.layers.Conv2D instead. W0718 11:59:27.658360 140435192031104 deprecation.py:323] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:177: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.BatchNormalization instead. In particular, tf.control_dependencies(tf.GraphKeys.UPDATE_OPS) should not be used (consult the tf.keras.layers.batch_normalization documentation). W0718 11:59:28.175011 140435192031104 deprecation.py:323] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/parts/rnns/utils.py:68: GRUCell.init (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.GRUCell, and will be replaced by that in Tensorflow 2.0. W0718 11:59:28.175675 140435192031104 deprecation.py:323] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/encoders/tacotron2_encoder.py:416: MultiRNNCell.init (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0. W0718 11:59:28.176146 140435192031104 deprecation.py:323] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/encoders/tacotron2_encoder.py:426: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version. Instructions for updating: Please use keras.layers.RNN(cell), which is equivalent to this API W0718 11:59:28.576258 140435192031104 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn_cell_impl.py:564: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0718 11:59:28.591121 140435192031104 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn_cell_impl.py:574: calling Zeros.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0718 11:59:28.846639 140435192031104 deprecation.py:323] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/encoders/tacotron2_encoder.py:459: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dense instead. W0718 11:59:29.186807 140435192031104 deprecation.py:506] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/encoders/tacotron2_encoder.py:486: calling RandomUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0718 11:59:29.196803 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/parts/transformer/attention_layer.py:54: The name tf.layers.Dense is deprecated. Please use tf.compat.v1.layers.Dense instead.

W0718 11:59:30.426726 140435192031104 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py:253: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob. W0718 11:59:30.470444 140435192031104 deprecation.py:323] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:159: conv1d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version. Instructions for updating: Use tf.keras.layers.Conv1D instead. W0718 11:59:30.619745 140435192031104 deprecation.py:323] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/encoders/tacotron2_encoder.py:209: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dropout instead. W0718 11:59:30.842733 140435192031104 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0718 11:59:30.843044 140435192031104 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0718 11:59:31.373543 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/encoders/tacotron2_encoder.py:323: The name tf.add_to_collection is deprecated. Please use tf.compat.v1.add_to_collection instead.

W0718 11:59:31.384380 140435192031104 deprecation.py:323] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/parts/rnns/utils.py:68: LSTMCell.init (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version. Instructions for updating: This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0. W0718 11:59:34.936816 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/losses/text2speech_loss.py:117: The name tf.losses.mean_squared_error is deprecated. Please use tf.compat.v1.losses.mean_squared_error instead.

*** Building graph on GPU:1 W0718 11:59:37.586050 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py:69: The name tf.losses.get_regularization_losses is deprecated. Please use tf.compat.v1.losses.get_regularization_losses instead.

W0718 11:59:37.611474 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py:175: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

W0718 11:59:43.413983 140435192031104 deprecation.py:323] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py:472: _colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. W0718 11:59:43.550502 140435192031104 deprecation_wrapper.py:119] From /content/drive/My Drive/Colab Notebooks/NVIDIA/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py:318: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

W0718 11:59:45.348784 140435192031104 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. Trainable variables: ForwardPass/tacotron2_encoder/EncoderEmbeddingMatrix:0 shape: (103, 512), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv1/kernel:0 shape: (3, 3, 1, 32), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv1/bn/gamma:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv1/bn/beta:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv2/kernel:0 shape: (3, 3, 32, 32), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv2/bn/gamma:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv2/bn/beta:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv3/kernel:0 shape: (3, 3, 32, 64), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv3/bn/gamma:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv3/bn/beta:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv4/kernel:0 shape: (3, 3, 64, 64), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv4/bn/gamma:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv4/bn/beta:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv5/kernel:0 shape: (3, 3, 64, 128), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv5/bn/gamma:0 shape: (128,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv5/bn/beta:0 shape: (128,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv6/kernel:0 shape: (3, 3, 128, 128), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv6/bn/gamma:0 shape: (128,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/conv6/bn/beta:0 shape: (128,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/rnn/multi_rnn_cell/cell_0/gru_cell/gates/kernel:0 shape: (384, 256), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/rnn/multi_rnn_cell/cell_0/gru_cell/gates/bias:0 shape: (256,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/rnn/multi_rnn_cell/cell_0/gru_cell/candidate/kernel:0 shape: (384, 128), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/rnn/multi_rnn_cell/cell_0/gru_cell/candidate/bias:0 shape: (128,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/reference_activation/kernel:0 shape: (128, 128), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/reference_activation/bias:0 shape: (128,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/attention/q/kernel:0 shape: (128, 512), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/attention/k/kernel:0 shape: (512, 512), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/attention/v/kernel:0 shape: (512, 512), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/attention/attention_v:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/style_encoder/attention/output_transform/kernel:0 shape: (512, 512), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/conv1/kernel:0 shape: (5, 512, 512), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/conv1/bn/gamma:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/conv1/bn/beta:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/conv2/kernel:0 shape: (5, 512, 512), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/conv2/bn/gamma:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/conv2/bn/beta:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/conv3/kernel:0 shape: (5, 512, 512), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/conv3/bn/gamma:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/conv3/bn/beta:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron2_encoder/cudnn_rnn/opaque_kernel:0 shape: , <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/AttentionMechanism/memory_layer/kernel:0 shape: (1, 1024, 128), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/prenet_1/kernel:0 shape: (80, 256), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/prenet_1/bias:0 shape: (256,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/prenet_2/kernel:0 shape: (256, 256), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/prenet_2/bias:0 shape: (256,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/attention_wrapper/multi_rnn_cell/cell_0/lstm_cell/kernel:0 shape: (2304, 4096), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/attention_wrapper/multi_rnn_cell/cell_0/lstm_cell/bias:0 shape: (4096,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/attention_wrapper/multi_rnn_cell/cell_1/lstm_cell/kernel:0 shape: (2048, 4096), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/attention_wrapper/multi_rnn_cell/cell_1/lstm_cell/bias:0 shape: (4096,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/attention_wrapper/location_attention/query_layer/kernel:0 shape: (1024, 128), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/attention_wrapper/location_attention/location_conv/kernel:0 shape: (32, 1, 32), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/attention_wrapper/location_attention/location_conv/bias:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/attention_wrapper/location_attention/location_dense/kernel:0 shape: (1, 32, 128), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/attention_wrapper/location_attention/attention_v:0 shape: (128,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/attention_wrapper/location_attention/attention_bias:0 shape: (128,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/output_proj/kernel:0 shape: (2048, 80), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/output_proj/bias:0 shape: (80,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/stop_token_proj/kernel:0 shape: (80, 1), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/decoder/stop_token_proj/bias:0 shape: (1,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv1/kernel:0 shape: (5, 80, 512), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv1/bn/gamma:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv1/bn/beta:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv2/kernel:0 shape: (5, 512, 512), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv2/bn/gamma:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv2/bn/beta:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv3/kernel:0 shape: (5, 512, 512), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv3/bn/gamma:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv3/bn/beta:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv4/kernel:0 shape: (5, 512, 512), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv4/bn/gamma:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv4/bn/beta:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv5/kernel:0 shape: (5, 512, 80), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv5/bn/gamma:0 shape: (80,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv5/bn/beta:0 shape: (80,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv_0/kernel:0 shape: (4, 80, 256), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv_0/bn/gamma:0 shape: (256,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv_0/bn/beta:0 shape: (256,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv_1/kernel:0 shape: (4, 256, 512), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv_1/bn/gamma:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/conv_1/bn/beta:0 shape: (512,), <dtype: 'float32_ref'> ForwardPass/tacotron_2_decoder/post_net_proj/kernel:0 shape: (1, 512, 401), <dtype: 'float32_ref'> Encountered unknown variable shape, can't compute total number of parameters. *** WARNING: Can't compute number of objects per step, since train model does not define get_num_objects_per_step method. 2019-07-18 11:59:49.376572: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz 2019-07-18 11:59:49.376828: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x642cd80 executing computations on platform Host. Devices: 2019-07-18 11:59:49.376867: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2019-07-18 11:59:49.379229: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2019-07-18 11:59:49.470651: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-07-18 11:59:49.471243: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x642b180 executing computations on platform CUDA. Devices: 2019-07-18 11:59:49.471281: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Tesla K80, Compute Capability 3.7 2019-07-18 11:59:49.471536: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-07-18 11:59:49.471949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 pciBusID: 0000:00:04.0 2019-07-18 11:59:49.472371: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-07-18 11:59:49.473836: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2019-07-18 11:59:49.475361: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2019-07-18 11:59:49.475941: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2019-07-18 11:59:49.477655: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2019-07-18 11:59:49.478846: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2019-07-18 11:59:49.482273: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-07-18 11:59:49.482427: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-07-18 11:59:49.482878: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-07-18 11:59:49.483252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0 2019-07-18 11:59:49.483338: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-07-18 11:59:49.484686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-07-18 11:59:49.484719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2019-07-18 11:59:49.484736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N 2019-07-18 11:59:49.485068: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-07-18 11:59:49.485561: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-07-18 11:59:49.486010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10802 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7) 2019-07-18 11:59:51.693068: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-07-18 11:59:52.729262: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. 2019-07-18 12:00:11.236970: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0`

lorinczb commented 5 years ago

I have tried to set the num_gpus to 1, 2, 3 and 4, but none of these worked, they got stuck at the same spot, as in the log file from the previous post.

borisgin commented 5 years ago

I noticed "Driver Version: 410.79 CUDA Version: 10.0 " Do you use local computer or cloud? What OS do you use? Have you checked the latest driver for your configuration: NVIDIA/drivers ?

lorinczb commented 5 years ago

I ran it on the cloud, on google colaboratory being connected to a hosted runtime and not local runtime. I don't think I have a way of changing the driver version. On local GPU it does run, I was just wondering if it is possible to run it on the cloud as well, that way I could use more resources.

astricks commented 4 years ago

Did this issue ever get resolved? I'm facing the same issue as @lorinczb. I'm running on the cloud with the nvidia docker container "nvcr.io/nvidia/tensorflow".

lorinczb commented 4 years ago

I have managed to run it locally, but did not get it to work on the cloud.

astricks commented 4 years ago

After enabling/disabling logging, fiddling around with the batch_size, adding some print statements in open_seq2seq/utils/funcs.py and running it for a few hours, I now see CPU and GPU usage. This is running inside an NVIDIA docker container on azure cloud.