HawkAaron / warp-transducer

A fast parallel implementation of RNN Transducer.
Apache License 2.0
307 stars 124 forks source link

build against pip-installed tensorflow-gpu gives segfault #6

Closed fginter closed 6 years ago

fginter commented 6 years ago

I use pip-installed tensorflow-gpu. To complete the build, I changed the following (in addition to #5 ):

1)

../../external/nsync/public  ->  external/nsync/public

2) use tf.sysconfig.get_lib()

if os.path.exists(os.path.join(tf.sysconfig.get_lib(), 'libtensorflow_framework.so')):
        extra_link_args = ['-L' + tf.sysconfig.get_lib(), '-ltensorflow_framework']

But I still get a segfault

$ python3 test_warprnnt_op.py 
2018-09-03 12:51:15.051736: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-09-03 12:51:15.145283: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-09-03 12:51:15.145685: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.09GiB
2018-09-03 12:51:15.145721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-03 12:51:15.468689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-03 12:51:15.468750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 
2018-09-03 12:51:15.468759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N 
2018-09-03 12:51:15.469071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3432 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
*** Received signal 11 ***
*** BEGIN MANGLED STACK TRACE ***
/home/ginter/venv-mozds-gpu/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x6a2c5b)[0x7fe9c3251c5b]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fe9fc009390]
/home/ginter/venv-mozds-gpu/lib/python3.5/site-packages/warprnnt_tensorflow-0.1-py3.5-linux-x86_64.egg/warprnnt_tensorflow/kernels.cpython-35m-x86_64-linux-gnu.so(_ZN9warp_rnnt14WarpRNNTOpBase7ComputeEPN10tensorflow15OpKernelContextE+0x2b7)[0x7fe98b6d2d37]
/home/ginter/venv-mozds-gpu/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow13BaseGPUDevice13ComputeHelperEPNS_8OpKernelEPNS_15OpKernelContextE+0x37d)[0x7fe9c317ecdd]
/home/ginter/venv-mozds-gpu/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x8d)[0x7fe9c317f14d]
/home/ginter/venv-mozds-gpu/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x61b061)[0x7fe9c31ca061]
/home/ginter/venv-mozds-gpu/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(+0x61b87a)[0x7fe9c31ca87a]
/home/ginter/venv-mozds-gpu/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZN5Eigen26NonBlockingThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x21a)[0x7fe9c322c22a]
/home/ginter/venv-mozds-gpu/lib/python3.5/site-packages/tensorflow/python/../libtensorflow_framework.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x32)[0x7fe9c322b2d2]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7fe9b99d6c80]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fe9fbfff6ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe9fbd3541d]
*** END MANGLED STACK TRACE ***

*** Begin stack trace ***
    tensorflow::CurrentStackTrace()

    warp_rnnt::WarpRNNTOpBase::Compute(tensorflow::OpKernelContext*)
    tensorflow::BaseGPUDevice::ComputeHelper(tensorflow::OpKernel*, tensorflow::OpKernelContext*)
    tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*)

    Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int)
    std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&)

    clone
*** End stack trace ***
Aborted (core dumped)
fginter commented 6 years ago

The library itself seems to be fine, so this is tensorflow bindings. Any thoughts on this?


(venv-mozds-gpu) ginter@speech-gpu-preempt:~/warp-transducer/build$ ./test_gpu 
Running gpu tests
finish small_test 1
finish options_test 1
finish inf_test 1
finished 1
Tests pass
``'
fginter commented 6 years ago

Tried with tensorflow 1.9. Here setup.py install works out of the box, but the same Segmentation Fault error occurs.

fginter commented 6 years ago

The segfault happens in this loop: https://github.com/HawkAaron/warp-transducer/blob/master/tensorflow_binding/src/warprnnt_op.cc#L82 when I comment it out, setup.py test finishes without error.

HawkAaron commented 6 years ago

Thank you very much! This error happened because all tensors are placed on GPU except "costs", so I temporarily remove those batch size checking code, it works well now.

YimengZhu commented 4 years ago

Hi,

Is there any similar fix in the pytorch binding?

I can run the built binaries like test_cpu, test_gpu without problem, but in pytorch binding I got the segfault.

Best regards, Yimeng