Open muntasir2000 opened 5 years ago
Can you attach the log file, please?
`
Starting training from scratch
Training config:
{'batch_size_per_gpu': 20,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'augmentation': {'noise_level_max': -60,
'noise_level_min': -90,
'speed_perturbation_ratio': 0.1},
'dataset_files': ['/hdd/stt-16k-seq2seq-train.csv'],
'input_type': 'spectrogram',
'max_duration': 16.7,
'num_audio_features': 160,
'shuffle': True,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.0,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'lm_path': '/lm/lm.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': tf.float32,
'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>,
'encoder_params': {'activation_fn': <function relu at 0x7f2f9e611bf8>,
'conv_layers': [{'kernel_size': [11, 41],
'num_channels': 32,
'padding': 'SAME',
'stride': [2, 2]},
{'kernel_size': [11, 21],
'num_channels': 64,
'padding': 'SAME',
'stride': [1, 2]},
{'kernel_size': [11, 21],
'num_channels': 96,
'padding': 'SAME',
'stride': [1, 2]}],
'data_format': 'channels_first',
'dropout_keep_prob': 0.5,
'n_hidden': 1600,
'num_rnn_layers': 5,
'rnn_cell_dim': 800,
'rnn_type': 'cudnn_gru',
'rnn_unidirectional': False,
'row_conv': False,
'use_cudnn_rnn': True},
'eval_steps': 500,
'initializer': <function xavier_initializer at 0x7f2f800be9d8>,
'larc_params': {'larc_eta': 0.001},
'load_model': '',
'logdir': 'experiments/2-mfi/logs',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'lr_policy': <function poly_decay at 0x7f2f79918840>,
'lr_policy_params': {'learning_rate': 0.0001, 'power': 0.5},
'num_epochs': 50,
'num_gpus': 3,
'optimizer': 'Adam',
'print_loss_steps': 10,
'print_samples_steps': 500,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7f2f80022c80>,
'regularizer_params': {'scale': 0.0005},
'save_checkpoint_steps': 1000,
'save_summaries_steps': 100,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': False,
'use_xla_jit': False}
Evaluation config:
{'batch_size_per_gpu': 20,
'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>,
'data_layer_params': {'dataset_files': ['/hdd/stt-16k-seq2seq-dev.csv'],
'input_type': 'spectrogram',
'num_audio_features': 160,
'shuffle': False,
'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'},
'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>,
'decoder_params': {'alpha': 2.0,
'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt',
'beam_width': 512,
'beta': 1.0,
'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so',
'lm_path': '/lm/lm.binary',
'trie_path': 'language_model/trie.binary',
'use_language_model': False},
'dtype': tf.float32,
'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>,
'encoder_params': {'activation_fn': <function relu at 0x7f2f9e611bf8>,
'conv_layers': [{'kernel_size': [11, 41],
'num_channels': 32,
'padding': 'SAME',
'stride': [2, 2]},
{'kernel_size': [11, 21],
'num_channels': 64,
'padding': 'SAME',
'stride': [1, 2]},
{'kernel_size': [11, 21],
'num_channels': 96,
'padding': 'SAME',
'stride': [1, 2]}],
'data_format': 'channels_first',
'dropout_keep_prob': 0.5,
'n_hidden': 1600,
'num_rnn_layers': 5,
'rnn_cell_dim': 800,
'rnn_type': 'cudnn_gru',
'rnn_unidirectional': False,
'row_conv': False,
'use_cudnn_rnn': True},
'eval_steps': 500,
'initializer': <function xavier_initializer at 0x7f2f800be9d8>,
'larc_params': {'larc_eta': 0.001},
'load_model': '',
'logdir': 'experiments/2-mfi/logs',
'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>,
'loss_params': {},
'lr_policy': <function poly_decay at 0x7f2f79918840>,
'lr_policy_params': {'learning_rate': 0.0001, 'power': 0.5},
'num_epochs': 50,
'num_gpus': 3,
'optimizer': 'Adam',
'print_loss_steps': 10,
'print_samples_steps': 500,
'random_seed': 0,
'regularizer': <function l2_regularizer at 0x7f2f80022c80>,
'regularizer_params': {'scale': 0.0005},
'save_checkpoint_steps': 1000,
'save_summaries_steps': 100,
'summaries': ['learning_rate',
'variables',
'gradients',
'larc_summaries',
'variable_norm',
'gradient_norm',
'global_gradient_norm'],
'use_horovod': False,
'use_xla_jit': False}
Building graph on GPU:0
Building graph on GPU:1
Building graph on GPU:2
Trainable variables:
ForwardPass/ds2_encoder/conv1/kernel:0
shape: (11, 41, 1, 32), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv1/bn/gamma:0
shape: (32,), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv1/bn/beta:0
shape: (32,), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv2/kernel:0
shape: (11, 21, 32, 64), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv2/bn/gamma:0
shape: (64,), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv2/bn/beta:0
shape: (64,), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv3/kernel:0
shape: (11, 21, 64, 96), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv3/bn/gamma:0
shape: (96,), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv3/bn/beta:0
shape: (96,), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/cudnn_gru/opaque_kernel:0
shape:
`
This is the log file. Please note, this log was generated running without docker. But the problem is same with docker. It's just stuck there. Even I can't kill the process without restarting the PC.
Here is the output of nvidia-smi, if it helps. Thanks
Maybe mismatch between CUDA version/ driver and TF container. Can you try latest container: tensorflow:19.04-py3 or tensorflow:19.05-py3, please?
I also tried without using docker container. Anyways, I'll try using tensorflow:19.05-py3 image.
I tried using tensorflow:19.05-py3 docker image. Same issue. Training hangs Log file follows -
WARNING: Please update time_stretch_ratio to speed_perturbation_ratio
WARNING: Please update time_stretch_ratio to speed_perturbation_ratio
*** Building graph on GPU:0
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/data/speech2text/speech2text.py:216: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
tf.py_function, which takes a python function which manipulates tf eager
tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means tf.py_function
s can use accelerators such as GPUs as well as
being differentiable using a gradient tape.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/dataset_ops.py:1419: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:159: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.conv2d instead.
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:177: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.batch_normalization instead.
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:387: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:389: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use rate
instead of keep_prob
. Rate should be set to rate = 1 - keep_prob
.
Building graph on GPU:1
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Trainable variables:
ForwardPass/ds2_encoder/conv1/kernel:0
shape: (11, 41, 1, 32), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv1/bn/gamma:0
shape: (32,), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv1/bn/beta:0
shape: (32,), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv2/kernel:0
shape: (11, 21, 32, 64), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv2/bn/gamma:0
shape: (64,), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv2/bn/beta:0
shape: (64,), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv3/kernel:0
shape: (11, 21, 64, 96), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv3/bn/gamma:0
shape: (96,), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/conv3/bn/beta:0
shape: (96,), <dtype: 'float32_ref'>
ForwardPass/ds2_encoder/cudnn_gru/opaque_kernel:0
shape:
Output from nvidia-smi - (using gpu 1 and 2, gpu 0 is being used by another process)
Thanks, looks like a bug, I will check with our TF team for possible reason and solution.
Can you check if you can successfully run these nccl tests on that machine? https://github.com/nvidia/nccl-tests
I tried to run nccl-tests, but the test also hangs the same way OpenSeq2Seq hangs. All GPUs show 100% usage constantly but hangs. I'm trying to follow this - https://github.com/NVIDIA/caffe/issues/10
I'll post the result. Thanks.
I am trying to run tacotron-gst on a single GPU, but hangs at the same spot, does not get past: Successfully opened dynamic library libcublas.so.10.0 this line. Was this issue resolved? I am running it on colaboratory.
Since this is not related to multi-GPU, can you open a new issue "Tacotron hangs on single GPU", please? Please attach the following
Was this problem ever resolved? I am facing the same issue as @lorinczb
I have the same issue, any new idea?
Facing a similar issue for tacotron-GST. Any idea how to resolve ?
When I try to train DeepSpeech2 using example configs using 3 GPUs, training hangs indefinitely. But single GPU training works well using same config file. I also tried using horovod. Same problem. I'm using nvcr.io/nvidia/tensorflow:18.12-py3 docker image