Multi GPU training hangs

muntasir2000 commented 5 years ago

When I try to train DeepSpeech2 using example configs using 3 GPUs, training hangs indefinitely. But single GPU training works well using same config file. I also tried using horovod. Same problem. I'm using nvcr.io/nvidia/tensorflow:18.12-py3 docker image

borisgin commented 5 years ago

Can you attach the log file, please?

muntasir2000 commented 5 years ago

` Starting training from scratch Training config: {'batch_size_per_gpu': 20, 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>, 'data_layer_params': {'augmentation': {'noise_level_max': -60, 'noise_level_min': -90, 'speed_perturbation_ratio': 0.1}, 'dataset_files': ['/hdd/stt-16k-seq2seq-train.csv'], 'input_type': 'spectrogram', 'max_duration': 16.7, 'num_audio_features': 160, 'shuffle': True, 'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'}, 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>, 'decoder_params': {'alpha': 2.0, 'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt', 'beam_width': 512, 'beta': 1.0, 'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so', 'lm_path': '/lm/lm.binary', 'trie_path': 'language_model/trie.binary', 'use_language_model': False}, 'dtype': tf.float32, 'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>, 'encoder_params': {'activation_fn': <function relu at 0x7f2f9e611bf8>, 'conv_layers': [{'kernel_size': [11, 41], 'num_channels': 32, 'padding': 'SAME', 'stride': [2, 2]}, {'kernel_size': [11, 21], 'num_channels': 64, 'padding': 'SAME', 'stride': [1, 2]}, {'kernel_size': [11, 21], 'num_channels': 96, 'padding': 'SAME', 'stride': [1, 2]}], 'data_format': 'channels_first', 'dropout_keep_prob': 0.5, 'n_hidden': 1600, 'num_rnn_layers': 5, 'rnn_cell_dim': 800, 'rnn_type': 'cudnn_gru', 'rnn_unidirectional': False, 'row_conv': False, 'use_cudnn_rnn': True}, 'eval_steps': 500, 'initializer': <function xavier_initializer at 0x7f2f800be9d8>, 'larc_params': {'larc_eta': 0.001}, 'load_model': '', 'logdir': 'experiments/2-mfi/logs', 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>, 'loss_params': {}, 'lr_policy': <function poly_decay at 0x7f2f79918840>, 'lr_policy_params': {'learning_rate': 0.0001, 'power': 0.5}, 'num_epochs': 50, 'num_gpus': 3, 'optimizer': 'Adam', 'print_loss_steps': 10, 'print_samples_steps': 500, 'random_seed': 0, 'regularizer': <function l2_regularizer at 0x7f2f80022c80>, 'regularizer_params': {'scale': 0.0005}, 'save_checkpoint_steps': 1000, 'save_summaries_steps': 100, 'summaries': ['learning_rate', 'variables', 'gradients', 'larc_summaries', 'variable_norm', 'gradient_norm', 'global_gradient_norm'], 'use_horovod': False, 'use_xla_jit': False} Evaluation config: {'batch_size_per_gpu': 20, 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>, 'data_layer_params': {'dataset_files': ['/hdd/stt-16k-seq2seq-dev.csv'], 'input_type': 'spectrogram', 'num_audio_features': 160, 'shuffle': False, 'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'}, 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>, 'decoder_params': {'alpha': 2.0, 'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt', 'beam_width': 512, 'beta': 1.0, 'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so', 'lm_path': '/lm/lm.binary', 'trie_path': 'language_model/trie.binary', 'use_language_model': False}, 'dtype': tf.float32, 'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>, 'encoder_params': {'activation_fn': <function relu at 0x7f2f9e611bf8>, 'conv_layers': [{'kernel_size': [11, 41], 'num_channels': 32, 'padding': 'SAME', 'stride': [2, 2]}, {'kernel_size': [11, 21], 'num_channels': 64, 'padding': 'SAME', 'stride': [1, 2]}, {'kernel_size': [11, 21], 'num_channels': 96, 'padding': 'SAME', 'stride': [1, 2]}], 'data_format': 'channels_first', 'dropout_keep_prob': 0.5, 'n_hidden': 1600, 'num_rnn_layers': 5, 'rnn_cell_dim': 800, 'rnn_type': 'cudnn_gru', 'rnn_unidirectional': False, 'row_conv': False, 'use_cudnn_rnn': True}, 'eval_steps': 500, 'initializer': <function xavier_initializer at 0x7f2f800be9d8>, 'larc_params': {'larc_eta': 0.001}, 'load_model': '', 'logdir': 'experiments/2-mfi/logs', 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>, 'loss_params': {}, 'lr_policy': <function poly_decay at 0x7f2f79918840>, 'lr_policy_params': {'learning_rate': 0.0001, 'power': 0.5}, 'num_epochs': 50, 'num_gpus': 3, 'optimizer': 'Adam', 'print_loss_steps': 10, 'print_samples_steps': 500, 'random_seed': 0, 'regularizer': <function l2_regularizer at 0x7f2f80022c80>, 'regularizer_params': {'scale': 0.0005}, 'save_checkpoint_steps': 1000, 'save_summaries_steps': 100, 'summaries': ['learning_rate', 'variables', 'gradients', 'larc_summaries', 'variable_norm', 'gradient_norm', 'global_gradient_norm'], 'use_horovod': False, 'use_xla_jit': False} Building graph on GPU:0 Building graph on GPU:1 Building graph on GPU:2 Trainable variables: ForwardPass/ds2_encoder/conv1/kernel:0 shape: (11, 41, 1, 32), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/gamma:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/beta:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/kernel:0 shape: (11, 21, 32, 64), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/gamma:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/beta:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv3/kernel:0 shape: (11, 21, 64, 96), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv3/bn/gamma:0 shape: (96,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv3/bn/beta:0 shape: (96,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/cudnn_gru/opaque_kernel:0 shape: , <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/kernel:0 shape: (1600, 1600), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/bias:0 shape: (1600,), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0 shape: (1600, 66), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0 shape: (66,), <dtype: 'float32_ref'> Encountered unknown variable shape, can't compute total number of parameters. Building graph on GPU:0 Building graph on GPU:1 *** Building graph on GPU:2 2019-05-25 01:56:26.454436: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-05-25 01:56:26.644988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721 pciBusID: 0000:09:00.0 totalMemory: 10.92GiB freeMemory: 10.77GiB 2019-05-25 01:56:26.767056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325 pciBusID: 0000:0a:00.0 totalMemory: 10.92GiB freeMemory: 10.77GiB 2019-05-25 01:56:26.855649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645 pciBusID: 0000:41:00.0 totalMemory: 10.91GiB freeMemory: 10.63GiB 2019-05-25 01:56:26.859528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2 2019-05-25 01:56:28.743446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-05-25 01:56:28.743487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 2019-05-25 01:56:28.743499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y 2019-05-25 01:56:28.743508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y 2019-05-25 01:56:28.743517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N 2019-05-25 01:56:28.744812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10419 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2019-05-25 01:56:28.746374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10419 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1) 2019-05-25 01:56:28.746584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10280 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:41:00.0, compute capability: 6.1)

`

This is the log file. Please note, this log was generated running without docker. But the problem is same with docker. It's just stuck there. Even I can't kill the process without restarting the PC.

Here is the output of nvidia-smi, if it helps. Thanks

borisgin commented 5 years ago

Maybe mismatch between CUDA version/ driver and TF container. Can you try latest container: tensorflow:19.04-py3 or tensorflow:19.05-py3, please?

muntasir2000 commented 5 years ago

I also tried without using docker container. Anyways, I'll try using tensorflow:19.05-py3 image.

muntasir2000 commented 5 years ago

I tried using tensorflow:19.05-py3 docker image. Same issue. Training hangs Log file follows -

WARNING: Please update time_stretch_ratio to speed_perturbation_ratio WARNING: Please update time_stretch_ratio to speed_perturbation_ratio *** Building graph on GPU:0 WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/data/speech2text/speech2text.py:216: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, use tf.py_function, which takes a python function which manipulates tf eager tensors instead of numpy arrays. It's easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors means tf.py_functions can use accelerators such as GPUs as well as being differentiable using a gradient tape.

WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/dataset_ops.py:1419: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:159: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.conv2d instead. WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:177: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.batch_normalization instead. WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:387: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dense instead. WARNING:tensorflow:From /workspace/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:389: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob. Building graph on GPU:1 WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Deprecated in favor of operator or tf.math.divide. Trainable variables: ForwardPass/ds2_encoder/conv1/kernel:0 shape: (11, 41, 1, 32), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/gamma:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/beta:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/kernel:0 shape: (11, 21, 32, 64), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/gamma:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/beta:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv3/kernel:0 shape: (11, 21, 64, 96), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv3/bn/gamma:0 shape: (96,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv3/bn/beta:0 shape: (96,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/cudnn_gru/opaque_kernel:0 shape: , <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/kernel:0 shape: (1600, 1600), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/bias:0 shape: (1600,), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0 shape: (1600, 66), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0 shape: (66,), <dtype: 'float32_ref'> Encountered unknown variable shape, can't compute total number of parameters. Building graph on GPU:0 *** Building graph on GPU:1 2019-05-27 21:05:16.629783: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3792535000 Hz 2019-05-27 21:05:16.631137: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x12990420 executing computations on platform Host. Devices: 2019-05-27 21:05:16.631165: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): , 2019-05-27 21:05:16.865771: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x12aacfb0 executing computations on platform CUDA. Devices: 2019-05-27 21:05:16.865821: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-05-27 21:05:16.865832: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-05-27 21:05:16.866588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6325 pciBusID: 0000:0a:00.0 totalMemory: 10.92GiB freeMemory: 10.77GiB 2019-05-27 21:05:16.867113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645 pciBusID: 0000:41:00.0 totalMemory: 10.91GiB freeMemory: 10.38GiB 2019-05-27 21:05:16.868351: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1 2019-05-27 21:05:18.512599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-05-27 21:05:18.512643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2019-05-27 21:05:18.512655: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y 2019-05-27 21:05:18.512659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N 2019-05-27 21:05:18.513611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10413 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1) 2019-05-27 21:05:18.514182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10034 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:41:00.0, compute capability: 6.1) 2019-05-27 21:05:38.831218: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10 locally

Output from nvidia-smi - (using gpu 1 and 2, gpu 0 is being used by another process)

borisgin commented 5 years ago

Thanks, looks like a bug, I will check with our TF team for possible reason and solution.

borisgin commented 5 years ago

Can you check if you can successfully run these nccl tests on that machine? https://github.com/nvidia/nccl-tests

muntasir2000 commented 5 years ago

I tried to run nccl-tests, but the test also hangs the same way OpenSeq2Seq hangs. All GPUs show 100% usage constantly but hangs. I'm trying to follow this - https://github.com/NVIDIA/caffe/issues/10

I'll post the result. Thanks.

lorinczb commented 5 years ago

I am trying to run tacotron-gst on a single GPU, but hangs at the same spot, does not get past: Successfully opened dynamic library libcublas.so.10.0 this line. Was this issue resolved? I am running it on colaboratory.

borisgin commented 5 years ago

Since this is not related to multi-GPU, can you open a new issue "Tacotron hangs on single GPU", please? Please attach the following

system information - Ubuntu version, GPU, driver version (nvidia-smi)
TF container information
log file

Shikherneo2 commented 4 years ago

Was this problem ever resolved? I am facing the same issue as @lorinczb

MinaJf commented 3 years ago

I have the same issue, any new idea?

swarajdalmia commented 3 years ago

Facing a similar issue for tacotron-GST. Any idea how to resolve ?

NVIDIA / OpenSeq2Seq

Multi GPU training hangs #448