NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

Out of memory during toy speech example? #329

Closed Bancherd-DeLong closed 5 years ago

Bancherd-DeLong commented 5 years ago

HI: I managed to build tensorflow with ctc, then tried to run the toy speech and then unittest examples; both resulted in Segmentation Fault. Shortly afterwhich, ubuntu indicated application init.py error. System: ubuntu 16.04, RAM=32GBytes, single GTX-1080 gpu(8GBytes), python =3.6, cuda = 9.2, cudnn=7.4, bazel = 0.21, tensorflow=1.12

### Here is output from toy speech example:

bancherd2@bancherd2-desktop:~/OpenSeq2Seq$ CUDA_VISIBLE_DEVICES=0 python run.py --config_file=example_configs/speech2text/ds2_toy_config.py --mode=train_eval /home/bancherd2/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype fromfloattonp.floatingis deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters *** Starting training from scratch *** Training config: {'batch_size_per_gpu': 2, 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>, 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'], 'input_type': 'spectrogram', 'num_audio_features': 160, 'shuffle': True, 'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'}, 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>, 'decoder_params': {'alpha': 1.0, 'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt', 'beam_width': 64, 'beta': 1.5, 'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so', 'initializer': <function xavier_initializer at 0x7f0a794b8a60>, 'lm_path': 'language_model/4-gram.binary', 'trie_path': 'language_model/trie.binary', 'use_language_model': False}, 'dtype': tf.float32, 'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>, 'encoder_params': {'activation_fn': <function <lambda> at 0x7f0a8be8b400>, 'conv_layers': [{'kernel_size': [11, 41], 'num_channels': 32, 'padding': 'SAME', 'stride': [2, 2]}, {'kernel_size': [11, 21], 'num_channels': 64, 'padding': 'SAME', 'stride': [1, 2]}, {'kernel_size': [11, 21], 'num_channels': 96, 'padding': 'SAME', 'stride': [1, 2]}], 'data_format': 'channels_first', 'dropout_keep_prob': 1.0, 'initializer': <function xavier_initializer at 0x7f0a794b8a60>, 'initializer_params': {'uniform': False}, 'n_hidden': 256, 'num_rnn_layers': 1, 'rnn_cell_dim': 256, 'rnn_type': 'gru', 'rnn_unidirectional': False, 'row_conv': False, 'row_conv_width': 8, 'use_cudnn_rnn': True}, 'eval_steps': 50, 'larc_params': {'larc_eta': 0.001}, 'load_model': '', 'logdir': 'tmp_log_folder', 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>, 'loss_params': {}, 'lr_policy': <function poly_decay at 0x7f0a71ffe158>, 'lr_policy_params': {'learning_rate': 0.001, 'power': 2}, 'num_epochs': 100, 'num_gpus': 2, 'optimizer': 'Momentum', 'optimizer_params': {'momentum': 0.9}, 'print_loss_steps': 10, 'print_samples_steps': 20, 'random_seed': 0, 'save_checkpoint_steps': 50, 'save_summaries_steps': 10, 'summaries': ['learning_rate', 'variables', 'gradients', 'larc_summaries', 'variable_norm', 'gradient_norm', 'global_gradient_norm'], 'use_horovod': False} *** Evaluation config: {'batch_size_per_gpu': 2, 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>, 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'], 'input_type': 'spectrogram', 'num_audio_features': 160, 'shuffle': False, 'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'}, 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>, 'decoder_params': {'alpha': 1.0, 'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt', 'beam_width': 64, 'beta': 1.5, 'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so', 'initializer': <function xavier_initializer at 0x7f0a794b8a60>, 'lm_path': 'language_model/4-gram.binary', 'trie_path': 'language_model/trie.binary', 'use_language_model': False}, 'dtype': tf.float32, 'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>, 'encoder_params': {'activation_fn': <function <lambda> at 0x7f0a8be8b400>, 'conv_layers': [{'kernel_size': [11, 41], 'num_channels': 32, 'padding': 'SAME', 'stride': [2, 2]}, {'kernel_size': [11, 21], 'num_channels': 64, 'padding': 'SAME', 'stride': [1, 2]}, {'kernel_size': [11, 21], 'num_channels': 96, 'padding': 'SAME', 'stride': [1, 2]}], 'data_format': 'channels_first', 'dropout_keep_prob': 1.0, 'initializer': <function xavier_initializer at 0x7f0a794b8a60>, 'initializer_params': {'uniform': False}, 'n_hidden': 256, 'num_rnn_layers': 1, 'rnn_cell_dim': 256, 'rnn_type': 'gru', 'rnn_unidirectional': False, 'row_conv': False, 'row_conv_width': 8, 'use_cudnn_rnn': True}, 'eval_steps': 50, 'larc_params': {'larc_eta': 0.001}, 'load_model': '', 'logdir': 'tmp_log_folder', 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>, 'loss_params': {}, 'lr_policy': <function poly_decay at 0x7f0a71ffe158>, 'lr_policy_params': {'learning_rate': 0.001, 'power': 2}, 'num_epochs': 100, 'num_gpus': 2, 'optimizer': 'Momentum', 'optimizer_params': {'momentum': 0.9}, 'print_loss_steps': 10, 'print_samples_steps': 20, 'random_seed': 0, 'save_checkpoint_steps': 50, 'save_summaries_steps': 10, 'summaries': ['learning_rate', 'variables', 'gradients', 'larc_summaries', 'variable_norm', 'gradient_norm', 'global_gradient_norm'], 'use_horovod': False} *** Building graph on GPU:0 WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/data/speech2text/speech2text.py:156: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, use tf.py_function, which takes a python function which manipulates tf eager tensors instead of numpy arrays. It's easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors meanstf.py_function`s can use accelerators such as GPUs as well as being differentiable using a gradient tape.

WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/data/speech2text/speech2text.py:210: DatasetV1.make_initializable_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use for ... in dataset: to iterate over a dataset. If using tf.estimator, return the Dataset object directly from your input function. As a last resort, you can use tf.compat.v1.data.make_initializable_iterator(dataset). WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py:1458: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:147: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.conv2d instead. WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:165: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.batch_normalization instead. WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:340: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1253: calling VarianceScaling.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:343: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:331: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dense instead. WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:333: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob. Building graph on GPU:1 WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Deprecated in favor of operator or tf.math.divide. WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/slot_creator.py:189: calling Zeros.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/slot_creator.py:195: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. Trainable variables: ForwardPass/ds2_encoder/conv1/kernel:0 shape: (11, 41, 1, 32), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/gamma:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/beta:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/kernel:0 shape: (11, 21, 32, 64), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/gamma:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/beta:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv3/kernel:0 shape: (11, 21, 64, 96), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv3/bn/gamma:0 shape: (96,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv3/bn/beta:0 shape: (96,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/cudnn_gru/opaque_kernel:0 shape: , <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/kernel:0 shape: (512, 256), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/bias:0 shape: (256,), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0 shape: (256, 29), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0 shape: (29,), <dtype: 'float32_ref'> Encountered unknown variable shape, can't compute total number of parameters. Building graph on GPU:0 *** Building graph on GPU:1 2019-01-07 20:24:56.292675: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3598710000 Hz 2019-01-07 20:24:56.292978: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x557297ad6480 executing computations on platform Host. Devices: 2019-01-07 20:24:56.292994: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): , 2019-01-07 20:24:56.414853: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-01-07 20:24:56.415296: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x557297aade40 executing computations on platform CUDA. Devices: 2019-01-07 20:24:56.415314: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1 2019-01-07 20:24:56.415539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1434] Found device 0 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.8225 pciBusID: 0000:01:00.0 totalMemory: 7.93GiB freeMemory: 7.61GiB 2019-01-07 20:24:56.415551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-07 20:24:56.616598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-07 20:24:56.616631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-07 20:24:56.616640: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-07 20:24:56.616797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7332 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-01-07 20:24:56.751173: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcudnn.so.7 locally 2019-01-07 20:24:57.273935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-07 20:24:57.273987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-07 20:24:57.273994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-07 20:24:57.273999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-07 20:24:57.274134: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7332 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-01-07 20:24:57.274527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-07 20:24:57.274561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-07 20:24:57.274579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-07 20:24:57.274583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-07 20:24:57.274788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7332 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-01-07 20:24:58.788262: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcublas.so.9.2 locally Segmentation fault (core dumped)`


### Here is output from unittests example: bancherd2@bancherd2-desktop:~/OpenSeq2Seq$ CUDA_VISIBLE_DEVICES=0 python -m unittest discover -s open_seq2seq -p '*_test.py' /home/bancherd2/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype fromfloattonp.floatingis deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters /home/bancherd2/anaconda3/lib/python3.6/site-packages/scipy/io/wavfile.py:129: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead data = numpy.fromstring(fid.read(size), dtype=dtype) ./home/bancherd2/anaconda3/lib/python3.6/site-packages/scipy/io/wavfile.py:129: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead data = numpy.fromstring(fid.read(size), dtype=dtype) ...sWARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/data/text2text/text2text.py:199: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, use tf.py_function, which takes a python function which manipulates tf eager tensors instead of numpy arrays. It's easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors meanstf.py_function`s can use accelerators such as GPUs as well as being differentiable using a gradient tape.

WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/data/text2text/text2text.py:237: DatasetV1.make_initializable_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use for ... in dataset: to iterate over a dataset. If using tf.estimator, return the Dataset object directly from your input function. As a last resort, you can use tf.compat.v1.data.make_initializable_iterator(dataset). WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py:1458: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. 10 10 WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/contextlib.py:60: TensorFlowTestCase.test_session (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use self.session() or self.cached_session() instead. 2019-01-07 20:33:33.164661: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3598710000 Hz 2019-01-07 20:33:33.164956: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x55ef049e58a0 executing computations on platform Host. Devices: 2019-01-07 20:33:33.164979: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): , 2019-01-07 20:33:33.275604: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-01-07 20:33:33.276066: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x55ef04ea2130 executing computations on platform CUDA. Devices: 2019-01-07 20:33:33.276086: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1 2019-01-07 20:33:33.276336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1434] Found device 0 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.8225 pciBusID: 0000:01:00.0 totalMemory: 7.93GiB freeMemory: 7.61GiB 2019-01-07 20:33:33.276360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-07 20:33:33.477898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-07 20:33:33.477931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-07 20:33:33.477936: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-07 20:33:33.478113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-01-07 20:33:33.489837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-07 20:33:33.489879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-07 20:33:33.489886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-07 20:33:33.489891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-07 20:33:33.490034: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) .2019-01-07 20:33:33.502969: W tensorflow/core/kernels/data/cache_dataset_ops.cc:810] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the datasetwill be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead. 10 10 2019-01-07 20:33:33.711064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-07 20:33:33.711095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-07 20:33:33.711101: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-07 20:33:33.711116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-07 20:33:33.711269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-01-07 20:33:33.746137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-07 20:33:33.746185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-07 20:33:33.746193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-07 20:33:33.746199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-07 20:33:33.746317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) .10 10 {'η': 0, 'κ': 1, 'ε': 2, 'θ': 3, 'γ': 4, 'ι': 5, 'α': 6, 'δ': 7, 'β': 8, 'ζ': 9} {0: 'η', 1: 'κ', 2: 'ε', 3: 'θ', 4: 'γ', 5: 'ι', 6: 'α', 7: 'δ', 8: 'β', 9: 'ζ'} 2019-01-07 20:33:34.097048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-07 20:33:34.097088: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-07 20:33:34.097095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-07 20:33:34.097098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-07 20:33:34.097259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-01-07 20:33:34.109067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-07 20:33:34.109110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-07 20:33:34.109132: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-07 20:33:34.109145: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-07 20:33:34.109309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) .2019-01-07 20:33:34.122101: W tensorflow/core/kernels/data/cache_dataset_ops.cc:810] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the datasetwill be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead. .Setting Up CrossEntropyWithSmoothingEqualsBasicSequenceLoss Test WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/losses/sequence_loss.py:75: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. 2019-01-07 20:33:34.437775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-07 20:33:34.437822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-07 20:33:34.437837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-07 20:33:34.437842: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-07 20:33:34.437999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) Loss: 31.36478042602539 Loss: 30.108806610107422 Loss: 30.074817657470703 Loss: 31.95147705078125 Loss: 30.073707580566406 Loss: 30.288209915161133 Loss: 31.124608993530273 Loss: 32.88024139404297 Loss: 30.553098678588867 Loss: 29.794387817382812 Loss: 30.176124572753906 Loss: 32.60198974609375 Loss: 22.505043029785156 Loss: 28.499502182006836 Loss: 23.019561767578125 Loss: 32.64523696899414 Loss: 22.60356330871582 Loss: 30.02012825012207 Loss: 22.19864273071289 Loss: 32.74278259277344 Loss: 21.97657012939453 Loss: 30.654129028320312 Loss: 23.141904830932617 Loss: 32.88565444946289 Loss: 19.59693717956543 Loss: 29.191438674926758 Loss: 20.269641876220703 Loss: 34.73362350463867 Loss: 20.396310806274414 Loss: 29.79734992980957 Loss: 19.520790100097656 Loss: 33.88055419921875 Loss: 20.312211990356445 Loss: 30.78615379333496 Loss: 20.244449615478516 Loss: 32.56880187988281 Tear down CrossEntropyWithSmoothingEqualsBasicSequenceLoss Test .Setting Up CrossEntropyWithSmoothingEqualsBasicSequenceLoss Test Tear down CrossEntropyWithSmoothingEqualsBasicSequenceLoss Test . Building graph on GPU:0 WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:147: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.conv2d instead. WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:165: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.batch_normalization instead. WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:340: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1253: calling VarianceScaling.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:343: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:331: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dense instead. WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:333: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob. WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Deprecated in favor of operator or tf.math.divide. WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/slot_creator.py:189: calling Zeros.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/slot_creator.py:195: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. Trainable variables: ForwardPass/ds2_encoder/conv1/kernel:0 shape: (5, 11, 1, 32), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/gamma:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/beta:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/kernel:0 shape: (5, 11, 32, 64), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/gamma:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/beta:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/cudnn_gru/opaque_kernel:0 shape: , <dtype: 'float32_ref'> ForwardPass/ds2_encoder/row_conv/w:0 shape: (8, 1, 256, 1), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/row_conv/bn/gamma:0 shape: (256,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/row_conv/bn/beta:0 shape: (256,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/kernel:0 shape: (256, 128), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/bias:0 shape: (128,), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0 shape: (128, 29), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0 shape: (29,), <dtype: 'float32_ref'> Encountered unknown variable shape, can't compute total number of parameters. Building graph on GPU:0 2019-01-07 20:33:40.315180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-07 20:33:40.315218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-07 20:33:40.315225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-07 20:33:40.315239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-07 20:33:40.315374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-01-07 20:33:40.407125: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcudnn.so.7 locally 2019-01-07 20:33:40.917435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-07 20:33:40.917481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-07 20:33:40.917489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-07 20:33:40.917494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-07 20:33:40.917626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-01-07 20:33:42.089752: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcublas.so.9.2 locally Segmentation fault (core dumped)

Please advise, thank you very much!`

borisgin commented 5 years ago

Based on log, this looks like latest master branch. Porting OpenSeq2Seq for this version is in work.

Can you rebuild with TF 1.12 please? I also recommend before rebuilding 1) update latest CUDA10 from https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604 2) update cudnn to 4 https://developer.nvidia.com/cudnn

Bancherd-DeLong commented 5 years ago

Sorry, still a no-go for both toy-speech and unittests examples. cuda and cudnn files used were: "cuda_10.0.130_410.48_linux.run" and "libcudnn7_7.4.1.5-1+cuda10.0_amd64.deb" TF=1.12.0 as printed out by python -c "import tensorflow as tf; print(tf.version)"

I tried to set "batch_size_per_gpu" in ds2_toy_config.py to 1(from 2), same results. While the output-log was stationary before Segmentation Fault message showed up, I checked the gpu-card via nvidia-smi:, "Volatile GPU-Util" = 0.

borisgin commented 5 years ago

This is after your rebuilt TF with new cuda and new cudnn, right? Can you attach a complete log, please?

Bancherd-DeLong commented 5 years ago

Yes, I rebuilt TF with new cuda/cudnn(10 & 7.4.1.5). Here are the log files: * for toy speech example *****

bancherd2@bancherd2-desktop:~/OpenSeq2Seq$ CUDA_VISIBLE_DEVICES=0 python run.py --config_file=example_configs/speech2text/ds2_toy_config.py --mode=train_eval /home/bancherd2/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype fromfloattonp.floatingis deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters *** Starting training from scratch *** Training config: {'batch_size_per_gpu': 2, 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>, 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'], 'input_type': 'spectrogram', 'num_audio_features': 160, 'shuffle': True, 'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'}, 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>, 'decoder_params': {'alpha': 1.0, 'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt', 'beam_width': 64, 'beta': 1.5, 'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so', 'initializer': <function xavier_initializer at 0x7f16d5fb6a60>, 'lm_path': 'language_model/4-gram.binary', 'trie_path': 'language_model/trie.binary', 'use_language_model': False}, 'dtype': tf.float32, 'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>, 'encoder_params': {'activation_fn': <function <lambda> at 0x7f16e89e3400>, 'conv_layers': [{'kernel_size': [11, 41], 'num_channels': 32, 'padding': 'SAME', 'stride': [2, 2]}, {'kernel_size': [11, 21], 'num_channels': 64, 'padding': 'SAME', 'stride': [1, 2]}, {'kernel_size': [11, 21], 'num_channels': 96, 'padding': 'SAME', 'stride': [1, 2]}], 'data_format': 'channels_first', 'dropout_keep_prob': 1.0, 'initializer': <function xavier_initializer at 0x7f16d5fb6a60>, 'initializer_params': {'uniform': False}, 'n_hidden': 256, 'num_rnn_layers': 1, 'rnn_cell_dim': 256, 'rnn_type': 'gru', 'rnn_unidirectional': False, 'row_conv': False, 'row_conv_width': 8, 'use_cudnn_rnn': True}, 'eval_steps': 50, 'larc_params': {'larc_eta': 0.001}, 'load_model': '', 'logdir': 'tmp_log_folder', 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>, 'loss_params': {}, 'lr_policy': <function poly_decay at 0x7f16cea34268>, 'lr_policy_params': {'learning_rate': 0.001, 'power': 2}, 'num_epochs': 100, 'num_gpus': 2, 'optimizer': 'Momentum', 'optimizer_params': {'momentum': 0.9}, 'print_loss_steps': 10, 'print_samples_steps': 20, 'random_seed': 0, 'save_checkpoint_steps': 50, 'save_summaries_steps': 10, 'summaries': ['learning_rate', 'variables', 'gradients', 'larc_summaries', 'variable_norm', 'gradient_norm', 'global_gradient_norm'], 'use_horovod': False} *** Evaluation config: {'batch_size_per_gpu': 2, 'data_layer': <class 'open_seq2seq.data.speech2text.speech2text.Speech2TextDataLayer'>, 'data_layer_params': {'dataset_files': ['open_seq2seq/test_utils/toy_speech_data/toy_data.csv'], 'input_type': 'spectrogram', 'num_audio_features': 160, 'shuffle': False, 'vocab_file': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt'}, 'decoder': <class 'open_seq2seq.decoders.fc_decoders.FullyConnectedCTCDecoder'>, 'decoder_params': {'alpha': 1.0, 'alphabet_config_path': 'open_seq2seq/test_utils/toy_speech_data/vocab.txt', 'beam_width': 64, 'beta': 1.5, 'decoder_library_path': 'ctc_decoder_with_lm/libctc_decoder_with_kenlm.so', 'initializer': <function xavier_initializer at 0x7f16d5fb6a60>, 'lm_path': 'language_model/4-gram.binary', 'trie_path': 'language_model/trie.binary', 'use_language_model': False}, 'dtype': tf.float32, 'encoder': <class 'open_seq2seq.encoders.ds2_encoder.DeepSpeech2Encoder'>, 'encoder_params': {'activation_fn': <function <lambda> at 0x7f16e89e3400>, 'conv_layers': [{'kernel_size': [11, 41], 'num_channels': 32, 'padding': 'SAME', 'stride': [2, 2]}, {'kernel_size': [11, 21], 'num_channels': 64, 'padding': 'SAME', 'stride': [1, 2]}, {'kernel_size': [11, 21], 'num_channels': 96, 'padding': 'SAME', 'stride': [1, 2]}], 'data_format': 'channels_first', 'dropout_keep_prob': 1.0, 'initializer': <function xavier_initializer at 0x7f16d5fb6a60>, 'initializer_params': {'uniform': False}, 'n_hidden': 256, 'num_rnn_layers': 1, 'rnn_cell_dim': 256, 'rnn_type': 'gru', 'rnn_unidirectional': False, 'row_conv': False, 'row_conv_width': 8, 'use_cudnn_rnn': True}, 'eval_steps': 50, 'larc_params': {'larc_eta': 0.001}, 'load_model': '', 'logdir': 'tmp_log_folder', 'loss': <class 'open_seq2seq.losses.ctc_loss.CTCLoss'>, 'loss_params': {}, 'lr_policy': <function poly_decay at 0x7f16cea34268>, 'lr_policy_params': {'learning_rate': 0.001, 'power': 2}, 'num_epochs': 100, 'num_gpus': 2, 'optimizer': 'Momentum', 'optimizer_params': {'momentum': 0.9}, 'print_loss_steps': 10, 'print_samples_steps': 20, 'random_seed': 0, 'save_checkpoint_steps': 50, 'save_summaries_steps': 10, 'summaries': ['learning_rate', 'variables', 'gradients', 'larc_summaries', 'variable_norm', 'gradient_norm', 'global_gradient_norm'], 'use_horovod': False} *** Building graph on GPU:0 WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/data/speech2text/speech2text.py:156: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, use tf.py_function, which takes a python function which manipulates tf eager tensors instead of numpy arrays. It's easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors meanstf.py_function`s can use accelerators such as GPUs as well as being differentiable using a gradient tape.

WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/data/speech2text/speech2text.py:210: DatasetV1.make_initializable_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use for ... in dataset: to iterate over a dataset. If using tf.estimator, return the Dataset object directly from your input function. As a last resort, you can use tf.compat.v1.data.make_initializable_iterator(dataset). WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py:1458: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:147: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.conv2d instead. WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:165: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.batch_normalization instead. WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:340: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1253: calling VarianceScaling.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:343: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:331: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dense instead. WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:333: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob. Building graph on GPU:1 WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Deprecated in favor of operator or tf.math.divide. WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/slot_creator.py:189: calling Zeros.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/slot_creator.py:195: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. Trainable variables: ForwardPass/ds2_encoder/conv1/kernel:0 shape: (11, 41, 1, 32), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/gamma:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/beta:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/kernel:0 shape: (11, 21, 32, 64), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/gamma:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/beta:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv3/kernel:0 shape: (11, 21, 64, 96), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv3/bn/gamma:0 shape: (96,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv3/bn/beta:0 shape: (96,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/cudnn_gru/opaque_kernel:0 shape: , <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/kernel:0 shape: (512, 256), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/bias:0 shape: (256,), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0 shape: (256, 29), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0 shape: (29,), <dtype: 'float32_ref'> Encountered unknown variable shape, can't compute total number of parameters. Building graph on GPU:0 *** Building graph on GPU:1 2019-01-09 11:12:42.456437: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3598890000 Hz 2019-01-09 11:12:42.493817: I tensorflow/compiler/xla/service/service.cc:162] XLA service 0x55eeb5fdf180 executing computations on platform Host. Devices: 2019-01-09 11:12:42.493864: I tensorflow/compiler/xla/service/service.cc:169] StreamExecutor device (0): , 2019-01-09 11:12:43.015080: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-01-09 11:12:43.025180: I tensorflow/compiler/xla/service/service.cc:162] XLA service 0x55eeb60b60f0 executing computations on platform CUDA. Devices: 2019-01-09 11:12:43.025235: I tensorflow/compiler/xla/service/service.cc:169] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1 2019-01-09 11:12:43.025529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1434] Found device 0 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.8225 pciBusID: 0000:01:00.0 totalMemory: 7.93GiB freeMemory: 7.61GiB 2019-01-09 11:12:43.025556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-09 11:12:47.531656: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-09 11:12:47.531697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-09 11:12:47.531707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-09 11:12:47.532025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7332 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-01-09 11:12:49.303027: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcudnn.so.7 locally 2019-01-09 11:12:57.985011: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcublas.so.10.0 locally Segmentation fault (core dumped)

* for unittests ***** bancherd2@bancherd2-desktop:~/OpenSeq2Seq$ CUDA_VISIBLE_DEVICES=0 python -m unittest discover -s open_seq2seq -p '*_test.py' /home/bancherd2/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype fromfloattonp.floatingis deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters /home/bancherd2/anaconda3/lib/python3.6/site-packages/scipy/io/wavfile.py:129: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead data = numpy.fromstring(fid.read(size), dtype=dtype) ./home/bancherd2/anaconda3/lib/python3.6/site-packages/scipy/io/wavfile.py:129: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead data = numpy.fromstring(fid.read(size), dtype=dtype) ...sWARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/data/text2text/text2text.py:199: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, use tf.py_function, which takes a python function which manipulates tf eager tensors instead of numpy arrays. It's easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors meanstf.py_function`s can use accelerators such as GPUs as well as being differentiable using a gradient tape.

WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/data/text2text/text2text.py:237: DatasetV1.make_initializable_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version. Instructions for updating: Use for ... in dataset: to iterate over a dataset. If using tf.estimator, return the Dataset object directly from your input function. As a last resort, you can use tf.compat.v1.data.make_initializable_iterator(dataset). WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py:1458: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. 10 10 WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/contextlib.py:60: TensorFlowTestCase.test_session (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use self.session() or self.cached_session() instead. 2019-01-09 11:16:23.488354: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3598890000 Hz 2019-01-09 11:16:23.488600: I tensorflow/compiler/xla/service/service.cc:162] XLA service 0x55987ab5cf20 executing computations on platform Host. Devices: 2019-01-09 11:16:23.488617: I tensorflow/compiler/xla/service/service.cc:169] StreamExecutor device (0): , 2019-01-09 11:16:23.593627: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-01-09 11:16:23.594095: I tensorflow/compiler/xla/service/service.cc:162] XLA service 0x55987b01a170 executing computations on platform CUDA. Devices: 2019-01-09 11:16:23.594114: I tensorflow/compiler/xla/service/service.cc:169] StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1 2019-01-09 11:16:23.594336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1434] Found device 0 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.8225 pciBusID: 0000:01:00.0 totalMemory: 7.93GiB freeMemory: 7.61GiB 2019-01-09 11:16:23.594352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-09 11:16:23.802252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-09 11:16:23.802280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-09 11:16:23.802287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-09 11:16:23.802475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) .2019-01-09 11:16:23.931316: W tensorflow/core/kernels/data/cache_dataset_ops.cc:812] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the datasetwill be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead. 10 10 2019-01-09 11:16:24.145555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-09 11:16:24.145598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-09 11:16:24.145608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-09 11:16:24.145617: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-09 11:16:24.145771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) .10 10 {'η': 0, 'κ': 1, 'ε': 2, 'θ': 3, 'γ': 4, 'ι': 5, 'α': 6, 'δ': 7, 'β': 8, 'ζ': 9} {0: 'η', 1: 'κ', 2: 'ε', 3: 'θ', 4: 'γ', 5: 'ι', 6: 'α', 7: 'δ', 8: 'β', 9: 'ζ'} 2019-01-09 11:16:24.525184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-09 11:16:24.525234: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-09 11:16:24.525241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-09 11:16:24.525246: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-09 11:16:24.525385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) .2019-01-09 11:16:24.549992: W tensorflow/core/kernels/data/cache_dataset_ops.cc:812] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the datasetwill be discarded. This can happen if you have an input pipeline similar to dataset.cache().take(k).repeat(). You should use dataset.take(k).cache().repeat() instead. .Setting Up CrossEntropyWithSmoothingEqualsBasicSequenceLoss Test WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/losses/sequence_loss.py:75: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. 2019-01-09 11:16:24.912225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-09 11:16:24.912278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-09 11:16:24.912290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-09 11:16:24.912298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-09 11:16:24.912432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) Loss: 31.36478042602539 Loss: 30.108806610107422 Loss: 30.074817657470703 Loss: 31.95147705078125 Loss: 30.073707580566406 Loss: 30.288209915161133 Loss: 31.124608993530273 Loss: 32.88024139404297 Loss: 30.553098678588867 Loss: 29.794387817382812 Loss: 30.176124572753906 Loss: 32.60198974609375 Loss: 22.505043029785156 Loss: 28.499502182006836 Loss: 23.019561767578125 Loss: 32.64523696899414 Loss: 22.60356330871582 Loss: 30.02012825012207 Loss: 22.19864273071289 Loss: 32.74278259277344 Loss: 21.97657012939453 Loss: 30.654129028320312 Loss: 23.141904830932617 Loss: 32.88565444946289 Loss: 19.59693717956543 Loss: 29.191438674926758 Loss: 20.269641876220703 Loss: 34.73362350463867 Loss: 20.396310806274414 Loss: 29.79734992980957 Loss: 19.520790100097656 Loss: 33.88055419921875 Loss: 20.312211990356445 Loss: 30.78615379333496 Loss: 20.244449615478516 Loss: 32.56880187988281 Tear down CrossEntropyWithSmoothingEqualsBasicSequenceLoss Test .Setting Up CrossEntropyWithSmoothingEqualsBasicSequenceLoss Test Tear down CrossEntropyWithSmoothingEqualsBasicSequenceLoss Test . Building graph on GPU:0 WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:147: conv2d (from tensorflow.python.layers.convolutional) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.conv2d instead. WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/parts/cnns/conv_blocks.py:165: batch_normalization (from tensorflow.python.layers.normalization) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.batch_normalization instead. WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:340: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1253: calling VarianceScaling.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:343: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:331: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version. Instructions for updating: Use keras.layers.dense instead. WARNING:tensorflow:From /home/bancherd2/OpenSeq2Seq/open_seq2seq/encoders/ds2_encoder.py:333: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version. Instructions for updating: Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob. WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Deprecated in favor of operator or tf.math.divide. WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/slot_creator.py:189: calling Zeros.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/bancherd2/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/slot_creator.py:195: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. Trainable variables: ForwardPass/ds2_encoder/conv1/kernel:0 shape: (5, 11, 1, 32), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/gamma:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv1/bn/beta:0 shape: (32,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/kernel:0 shape: (5, 11, 32, 64), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/gamma:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/conv2/bn/beta:0 shape: (64,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/cudnn_gru/opaque_kernel:0 shape: , <dtype: 'float32_ref'> ForwardPass/ds2_encoder/row_conv/w:0 shape: (8, 1, 256, 1), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/row_conv/bn/gamma:0 shape: (256,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/row_conv/bn/beta:0 shape: (256,), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/kernel:0 shape: (256, 128), <dtype: 'float32_ref'> ForwardPass/ds2_encoder/fully_connected/bias:0 shape: (128,), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/kernel:0 shape: (128, 29), <dtype: 'float32_ref'> ForwardPass/fully_connected_ctc_decoder/fully_connected/bias:0 shape: (29,), <dtype: 'float32_ref'> Encountered unknown variable shape, can't compute total number of parameters. Building graph on GPU:0 2019-01-09 11:16:31.311594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1513] Adding visible gpu devices: 0 2019-01-09 11:16:31.311642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:985] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-01-09 11:16:31.311650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0 2019-01-09 11:16:31.311657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1004] 0: N 2019-01-09 11:16:31.311789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1116] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-01-09 11:16:31.407940: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcudnn.so.7 locally 2019-01-09 11:16:33.212008: I tensorflow/stream_executor/platform/default/dso_loader.cc:154] successfully opened CUDA library libcublas.so.10.0 locally Segmentation fault (core dumped)


Thank you.

Bancherd-DeLong commented 5 years ago

**** These are the 5/6 files from "temp_log_folder" after running the toy speech exmaple(the last file "model.ckpt-0.data-00000-of-00001" is 56M, too big to upload). tmp_log_folder.tar.gz tmp_log_folder2.tar.gz

borisgin commented 5 years ago

I can't reproduce the issue on my machine. Maybe the simple workaround would be to use pre-build docker image as described here https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html

  1. install https://github.com/NVIDIA/nvidia-docker
  2. pull latest tensorflow image from NHC https://ngc.nvidia.com/signin/email clone latest openseq2seq
chuckcho commented 5 years ago

i have the exactly same issue. System: ubuntu 16.04, RAM=24GBytes, dual GeForce GTX TITAN gpus (6GBytes), python =3.5.2, cuda = 9.0, cudnn=7.4.2, bazel = 0.21, tensorflow=1.12

borisgin commented 5 years ago

We reproduced the issue with latest TF master. Started debugging...

borisgin commented 5 years ago

Till we find the fix, I would suggest to use Nvidia TF container. Here are detailed instructions:

  1. install docker ( see https://docs.docker.com/install/linux/docker-ce/ubuntu/#prerequisites ) use version compatible with nvidia-docker, e.g. $ sudo apt-get install docker-ce=5:18.09.0~3-0~ubuntu-xenial

  2. verify the installation: $ sudo docker container run hello-world

  3. add yourself to docker group $ sudo usermod -a -G docker $USER logout after that

  4. install nvidia-docker2 ( see https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0) $ sudo apt-get install nvidia-docker2 $ sudo pkill -SIGHUP dockerd

  5. pull latest nvidia Tensorflow container from Nvidia GPU Cloud see https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html $ docker pull nvcr.io/nvidia/tensorflow:18.12-py3

  6. run contrainer $ nvidia-docker run -it --rm nvcr.io/nvidia/tensorflow:18.12-py3

  7. pull Openseq2Seq when inside container git clone https://github.com/NVIDIA/OpenSeq2Seq

Bancherd-DeLong commented 5 years ago

Till we find the fix, I would suggest to use Nvidia TF container. Here are detailed instructions:

1. **install docker** ( see [https://docs.docker.com/install/linux/docker-ce/ubuntu/#prerequisites](url) )
   use version compatible with nvidia-docker, e.g.
   $ sudo apt-get install docker-ce=5:18.09.03-0ubuntu-xenial

2. **verify the installation:**
   $ sudo docker container run hello-world

3. **add yourself to docker group**
   $ sudo usermod -a -G docker $USER
   logout after that

4. **install nvidia-docker2** ( see [https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)](url) )
   $ sudo apt-get install nvidia-docker2
   $ sudo pkill -SIGHUP dockerd

5. **pull latest nvidia Tensorflow container from Nvidia GPU Cloud**
   see [https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html](url)
   $ docker pull nvcr.io/nvidia/tensorflow:18.12-py3

6. **run contrainer**
   $ nvidia-docker run -it --rm  nvcr.io/nvidia/tensorflow:18.12-py3

7. **pull Openseq2Seq when inside container**
   git clone https://github.com/NVIDIA/OpenSeq2Seq

OK, playing around with it. Thank you.

borisgin commented 5 years ago

The issue is fixed, will be released with the latest Nvidia tensorflow container 19.01