NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) #342

Closed vaibhav0195 closed 5 years ago

vaibhav0195 commented 5 years ago

Hi, thanks for this beautiful repository. I am trying to run this on a system with following requirement: OS : ubuntu16.04 Cuda: 10.0 cudnn: 7.4.2 tensorflow: 1.12 Graphicscard: nvidia-2080ti

So I have some audiofiles which are segmented into 1-3secs and I also have the csv-file with the headers [wav_filename,wav_filesize,transcript] for train test and evaluation. I followed the install instruction at :https://nvidia.github.io/OpenSeq2Seq/html/installation.html But when trying to start the ds2 instance i always get the error with the following stack-trace


/home/yoda/.virtualenvs/openSeqSeqP3/bin/python /home/yoda/ML/OpenSeq2Seq/run.py --config_file=/home/yoda/ML/OpenSeq2Seq/example_configs/speech2text/ds2_small_1gpu.py --enable_logs --mode=train_eval Starting training from scratch Training config: {'batch_size_per_gpu': 32, 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>, 'data_layer_params': {'augmentation': {'noise_level_max': -60, 'noise_level_min': -90, 'time_stretch_ratio': 0.05}, 'dataset_files': ['/media/yoda/gargantua/data_pb/data/voiceassistant/trainData/audioData/transcripts/agent/trainDataNov23_withOsSize_withoutlongsentence.csv'], 'input_type': 'spectrogram', 'num_audio_features': 96, 'shuffle': True, 'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'}, 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>, 'decoder_params': {'alpha': 2.0, 'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt', 'beam_width': 512, 'beta': 1.0, 'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so', 'lm_path': 'language_model/4-gram.binary', 'trie_path': 'language_model/trie.binary', 'use_language_model': False}, 'dtype': tf.float32, 'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>, 'encoder_params': {'activation_fn': <function relu at 0x7f44faf44510>, 'conv_layers': [{'kernel_size': [11, 41], 'num_channels': 32, 'padding': 'SAME', 'stride': [2, 2]}, {'kernel_size': [11, 21], 'num_channels': 32, 'padding': 'SAME', 'stride': [1, 2]}], 'data_format': 'channels_first', 'dropout_keep_prob': 0.5, 'n_hidden': 1024, 'num_rnn_layers': 2, 'rnn_cell_dim': 512, 'rnn_type': 'cudnn_gru', 'rnn_unidirectional': False, 'row_conv': False, 'use_cudnn_rnn': True}, 'eval_steps': 5000, 'initializer': <function xavier_initializer at 0x7f44c5ebb950>, 'load_model': '', 'logdir': 'experiments/librispeech-quick/logs', 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>, 'loss_params': {}, 'lr_policy': <function exp_decay at 0x7f44bfa1abf8>, 'lr_policy_params': {'begin_decay_at': 0, 'decay_rate': 0.9, 'decay_steps': 5000, 'learning_rate': 0.0001, 'min_lr': 0.0, 'use_staircase_decay': True}, 'num_epochs': 12, 'num_gpus': 1, 'optimizer': 'Adam', 'optimizer_params': {}, 'print_loss_steps': 10, 'print_samples_steps': 5000, 'random_seed': 0, 'regularizer': <function l2_regularizer at 0x7f44c7ced488>, 'regularizer_params': {'scale': 0.0005}, 'save_checkpoint_steps': 1000, 'save_summaries_steps': 100, 'summaries': ['learning_rate', 'variables', 'gradients', 'larc_summaries', 'variable_norm', 'gradient_norm', 'global_gradient_norm'], 'use_horovod': False} Evaluation config: {'batch_size_per_gpu': 32, 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>, 'data_layer_params': {'dataset_files': ['/media/yoda/gargantua/data_pb/data/voiceassistant/trainData/audioData/transcripts/agent/valDataNov23_withOsSize_withoutlongsentence.csv'], 'input_type': 'spectrogram', 'num_audio_features': 96, 'shuffle': False, 'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'}, 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>, 'decoder_params': {'alpha': 2.0, 'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt', 'beam_width': 512, 'beta': 1.0, 'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so', 'lm_path': 'language_model/4-gram.binary', 'trie_path': 'language_model/trie.binary', 'use_language_model': False}, 'dtype': tf.float32, 'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>, 'encoder_params': {'activation_fn': <function relu at 0x7f44faf44510>, 'conv_layers': [{'kernel_size': [11, 41], 'num_channels': 32, 'padding': 'SAME', 'stride': [2, 2]}, {'kernel_size': [11, 21], 'num_channels': 32, 'padding': 'SAME', 'stride': [1, 2]}], 'data_format': 'channels_first', 'dropout_keep_prob': 0.5, 'n_hidden': 1024, 'num_rnn_layers': 2, 'rnn_cell_dim': 512, 'rnn_type': 'cudnn_gru', 'rnn_unidirectional': False, 'row_conv': False, 'use_cudnn_rnn': True}, 'eval_steps': 5000, 'initializer': <function xavier_initializer at 0x7f44c5ebb950>, 'load_model': '', 'logdir': 'experiments/librispeech-quick/logs', 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>, 'loss_params': {}, 'lr_policy': <function exp_decay at 0x7f44bfa1abf8>, 'lr_policy_params': {'begin_decay_at': 0, 'decay_rate': 0.9, 'decay_steps': 5000, 'learning_rate': 0.0001, 'min_lr': 0.0, 'use_staircase_decay': True}, 'num_epochs': 12, 'num_gpus': 1, 'optimizer': 'Adam', 'optimizer_params': {}, 'print_loss_steps': 10, 'print_samples_steps': 5000, 'random_seed': 0, 'regularizer': <function l2_regularizer at 0x7f44c7ced488>, 'regularizer_params': {'scale': 0.0005}, 'save_checkpoint_steps': 1000, 'save_summaries_steps': 100, 'summaries': ['learning_rate', 'variables', 'gradients', 'larc_summaries', 'variable_norm', 'gradient_norm', 'global_gradient_norm'], 'use_horovod': False} Building graph on GPU:0 WARNING:tensorflow:From /media/yoda/jobs/ML/OpenSeq2Seq/open_seq2seq/data/speech2text/speech2text.py:156: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, use tf.py_function, which takes a python function which manipulates tf eager tensors instead of numpy arrays. It's easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors means tf.py_functions can use accelerators such as GPUs as well as being differentiable using a gradient tape.

WARNING:tensorflow:From /media/yoda/jobs/ML/OpenSeq2Seq/open_seq2seq/data/speech2text/speech2text.py:210: DatasetV1.make_initializable_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use for ... in dataset: to iterate over a dataset. If using tf.estimator, return the Dataset object directly from your input function. As a last resort, you can use tf.compat.v1.data.make_initializable_iterator(dataset). WARNING:tensorflow:From /home/yoda/.virtualenvs/openSeqSeqP3/lib/python3.5/site-packages/tensorflow/python/data/ops/dataset_ops.py:1458: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /media/yoda/jobs/ML/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:147: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version. Instructions for updating: Use tf.keras.layers.Conv2D instead. WARNING:tensorflow:From /media/yoda/jobs/ML/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:165: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.batch_normalization instead. WARNING:tensorflow:From /home/yoda/.virtualenvs/openSeqSeqP3/lib/python3.5/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:340: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/yoda/.virtualenvs/openSeqSeqP3/lib/python3.5/site-packages/tensorflow/python/ops/init_ops.py:1253: calling VarianceScaling.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/yoda/.virtualenvs/openSeqSeqP3/lib/python3.5/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:343: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /media/yoda/jobs/ML/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:331: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dense instead. WARNING:tensorflow:From /media/yoda/jobs/ML/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:333: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob. WARNING:tensorflow:From /home/yoda/.virtualenvs/openSeqSeqP3/lib/python3.5/site-packages/tensorflow/python/training/slot_creator.py:187: calling Zeros.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/yoda/.virtualenvs/openSeqSeqP3/lib/python3.5/site-packages/tensorflow/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. Trainable variables: ForwardPass/ds2_encoder/conv1/kernel:0 shape: (11, 41, 1, 32), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/gamma:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/beta:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/kernel:0 shape: (11, 21, 32, 32), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/gamma:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/beta:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/cudnn_gru/opaque_kernel:0 shape: , <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/kernel:0 shape: (1024, 1024), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/bias:0 shape: (1024,), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0 shape: (1024, 30), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0 shape: (30,), <dtype: 'float32_ref'> Encountered unknown variable shape, can't compute total number of parameters. *** Building graph on GPU:0 2019-01-18 15:43:21.868949: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcuda.so.1 locally 2019-01-18 15:43:21.985261: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1003] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-01-18 15:43:21.985771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1434] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.65 pciBusID: 0000:03:00.0 totalMemory: 10.73GiB freeMemory: 10.01GiB 2019-01-18 15:43:21.985785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-18 15:43:21.986483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-18 15:43:21.986491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-18 15:43:21.986495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-18 15:43:21.986658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9735 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5) 2019-01-18 15:43:22.437205: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcudnn.so.7 locally 2019-01-18 15:43:25.626130: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcublas.so.10.0 locally

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)


Note: I think the error is in the initialization of feed_dict. as for my initial iteration it is setting 'feed_dict={}' which in turn is triggering this error. Any help is appreciated. Thanks :)

borisgin commented 5 years ago

If you work inside Nvidia conatiner (see new instructions https://nvidia.github.io/OpenSeq2Seq/html/installation.html, do you still have SIGSEGV?
Note: if you want to use your dataset inside docker you have to add "-v v /mydata:/mydata " to mount your data inside container: nvidia-docker run -it --rm -v /mydata:/mydata nvcr.io/nvidia/tensorflow:18.12-py3

vaibhav0195 commented 5 years ago

Hi @borisgin , Thanks for your response, I can try installing the whole environment using the nvidia-container as of now I was not running the nvidia container. I have built the tensorflow from the source using the bazel. So can you suggest anything for it ?

Edit : When i used the nvidia-docker instruction the training works perfectly. Thanks :)

So it is the issue with my tensorflow installation ? and also what is the mistake i am doing when i am running it without docker ?

Thanks

borisgin commented 5 years ago

You can build from source using r1.11 branch instead of master

vaibhav0195 commented 5 years ago

Oh ok. Thanks I will try it.

Edit : Thanks @borisgin it worked i although installed tf1.10 but it worked thanks :) Closing this issue. But can you tell me Why such behaviour was happening ?

Thanks :)