HawkAaron / warp-transducer

A fast parallel implementation of RNN Transducer.
Apache License 2.0
307 stars 124 forks source link

Getting segmentation fault on CentOS #88

Open stefan-falk opened 3 years ago

stefan-falk commented 3 years ago

I am using the warp-transducer successfully on other machines (Ubuntu 18.04) but on one, which is a CentOS, I am getting a Segmentation Fault right at the beginning of the training.

Now, I am not sure what is causing this. The only difference I can point out is that the CentOS machine uses gcc/g++ 4.8.5 (also tried 5.3.1) instead of 5.4.x on my other machines. Could this be the reason for that issue?

Compilation Output

$ CUDA_HOME=/usr/local/cuda ./scripts/build_rnnt.sh
Removing existing build/ directory ..
#################################################################
Running cmake for warp-transducer ..
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/lib64/ccache/cc
-- Check for working C compiler: /usr/lib64/ccache/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/lib64/ccache/c++
-- Check for working CXX compiler: /usr/lib64/ccache/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Found CUDA: /usr/local/cuda (found version "11.0") 
-- cuda found TRUE
-- Building shared library with GPU support
-- Configuring done
-- Generating done
CMake Warning:
  Manually-specified variables were not used by the project:

    CMAKE_CXX_COMPILER_LAUNCHER
    CMAKE_C_COMPILER_LAUNCHER

-- Build files have been written to: /home/sfalk/workspaces/git/speech-v2/warp-transducer/build
#################################################################
Running make ..
[ 11%] Building NVCC (Device) object CMakeFiles/warprnnt.dir/src/./warprnnt_generated_rnnt_entrypoint.cu.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Scanning dependencies of target warprnnt
Linking CXX shared library libwarprnnt.so
[ 11%] Built target warprnnt
Scanning dependencies of target test_cpu
[ 22%] Building CXX object CMakeFiles/test_cpu.dir/tests/test_cpu.cpp.o
[ 33%] Building CXX object CMakeFiles/test_cpu.dir/tests/random.cpp.o
Linking CXX executable test_cpu
[ 33%] Built target test_cpu
[ 44%] Building NVCC (Device) object CMakeFiles/test_gpu.dir/tests/./test_gpu_generated_test_gpu.cu.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Scanning dependencies of target test_gpu
[ 55%] Building CXX object CMakeFiles/test_gpu.dir/tests/random.cpp.o
Linking CXX executable test_gpu
[ 55%] Built target test_gpu
Scanning dependencies of target test_time
[ 66%] Building CXX object CMakeFiles/test_time.dir/tests/test_time.cpp.o
[ 77%] Building CXX object CMakeFiles/test_time.dir/tests/random.cpp.o
Linking CXX executable test_time
[ 77%] Built target test_time
[ 88%] Building NVCC (Device) object CMakeFiles/test_time_gpu.dir/tests/./test_time_gpu_generated_test_time.cu.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Scanning dependencies of target test_time_gpu
[100%] Building CXX object CMakeFiles/test_time_gpu.dir/tests/random.cpp.o
Linking CXX executable test_time_gpu
[100%] Built target test_time_gpu
#################################################################
Running setup.py for tensorflow bindings ..
2021-03-11 08:32:27.494442: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
setup.py:63: UserWarning: Assuming tensorflow was compiled without C++11 ABI. It is generally true if you are using binary pip package. If you compiled tensorflow from source with gcc >= 5 and didn't set -D_GLIBCXX_USE_CXX11_ABI=0 during compilation, you need to set environment variable TF_CXX11_ABI=1 when compiling this bindings. Also be sure to touch some files in src to trigger recompilation. Also, you need to set (or unsed) this environment variable if getting undefined symbol: _ZN10tensorflow... errors
  warnings.warn("Assuming tensorflow was compiled without C++11 ABI. "
running install
running bdist_egg
running egg_info
writing warprnnt_tensorflow.egg-info/PKG-INFO
writing dependency_links to warprnnt_tensorflow.egg-info/dependency_links.txt
writing top-level names to warprnnt_tensorflow.egg-info/top_level.txt
reading manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt'
writing manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/warprnnt_tensorflow
copying build/lib.linux-x86_64-3.8/warprnnt_tensorflow/__init__.py -> build/bdist.linux-x86_64/egg/warprnnt_tensorflow
copying build/lib.linux-x86_64-3.8/warprnnt_tensorflow/kernels.cpython-38-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg/warprnnt_tensorflow
byte-compiling build/bdist.linux-x86_64/egg/warprnnt_tensorflow/__init__.py to __init__.cpython-38.pyc
creating stub loader for warprnnt_tensorflow/kernels.cpython-38-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/warprnnt_tensorflow/kernels.py to kernels.cpython-38.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_tensorflow.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_tensorflow.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_tensorflow.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_tensorflow.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
warprnnt_tensorflow.__pycache__.__init__.cpython-38: module references __path__
warprnnt_tensorflow.__pycache__.kernels.cpython-38: module references __file__
creating 'dist/warprnnt_tensorflow-0.1-py3.8-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing warprnnt_tensorflow-0.1-py3.8-linux-x86_64.egg
creating /home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/warprnnt_tensorflow-0.1-py3.8-linux-x86_64.egg
Extracting warprnnt_tensorflow-0.1-py3.8-linux-x86_64.egg to /home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages
Adding warprnnt-tensorflow 0.1 to easy-install.pth file

Installed /home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/warprnnt_tensorflow-0.1-py3.8-linux-x86_64.egg
Processing dependencies for warprnnt-tensorflow==0.1
Finished processing dependencies for warprnnt-tensorflow==0.1
(asr2) [sfalk@everestspeech-v2]$ python -c "from warprnnt_tensorflow import rnnt_loss"
2021-03-11 08:32:42.757357: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-03-11 08:32:44.642952: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

Segmentation Fault

Epoch 1/5000
Fatal Python error: Segmentation fault

Current thread 0x00007f8ea1ffa700 (most recent call first):
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1853 in _create_c_op
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 2015 in __init__
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3528 in _create_op_internal
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 590 in _create_op_internal
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 748 in _apply_op_helper
  File "<string>", line 80 in warp_rnnt
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/warprnnt_tensorflow-0.1-py3.8-linux-x86_64.egg/warprnnt_tensorflow/__init__.py", line 32 in rnnt_loss
  File "/home/sfalk/workspaces/git/speech-v2/asr/model/transducer/__init__.py", line 252 in rnnt_loss_wrapper
  File "/home/sfalk/workspaces/git/speech-v2/asr/model/transducer/__init__.py", line 209 in rnnt_gradient
  File "/home/sfalk/workspaces/git/speech-v2/asr/model/transducer/__init__.py", line 163 in train_step
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 788 in run_step
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 478 in _call_unconverted
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 396 in converted_call
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 667 in wrapper
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py", line 323 in run
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f9389663740 (most recent call first):
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/threading.py", line 302 in wait
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/threading.py", line 558 in wait
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py", line 196 in _call_for_each_replica
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py", line 93 in call_for_each_replica
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 628 in _call_for_each_replica
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2730 in call_for_each_replica
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1259 in run
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 795 in step_function
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 479 in _call_unconverted
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 396 in converted_call
  File "/tmp/tmpembj6sob.py", line 16 in tf__train_function
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 459 in converted_call
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 966 in wrapper
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 634 in wrapped_fn
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 990 in func_graph_from_py_func
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3196 in _create_graph_function
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3361 in _maybe_define_function
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2969 in _get_concrete_function_internal_garbage_collected
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 725 in _initialize
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 871 in _call
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828 in __call__
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1100 in fit
  File "asr/bin/train_keras.py", line 256 in run_training
  File "asr/bin/train_keras.py", line 292 in main
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/absl/app.py", line 251 in _run_main
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/absl/app.py", line 300 in run
  File "asr/bin/train_keras.py", line 381 in <module>
Segmentation fault (core dumped)