HawkAaron / warp-transducer

A fast parallel implementation of RNN Transducer.
Apache License 2.0
307 stars 124 forks source link

tensorflow test failed at test_multiple_batches_gpu with results dismatch #30

Closed qppp558 closed 5 years ago

qppp558 commented 5 years ago

OS: Ubuntu 16.04.3 LTS CUDA version: 9.0 GPU: Tesla P100

I built and installed the tensorflow-bind and it seems no error. However, when I am trying the unit test by running python setup.py test , it failed with the following information:

setup.py:63: UserWarning: Assuming tensorflow was compiled without C++11 ABI. It is generally true if you are using binary pip package. If you compiled tensorflow from source with gcc >= 5 and didn't set -D_GLIBCXX_USE_CXX11_ABI=0 during compilation, you need to set environment variable TF_CXX11_ABI=1 when compiling this bindings. Also be sure to touch some files in src to trigger recompilation. Also, you need to set (or unsed) this environment variable if getting undefined symbol: _ZN10tensorflow... errors warnings.warn("Assuming tensorflow was compiled without C++11 ABI. " running test running egg_info writing warprnnt_tensorflow.egg-info/PKG-INFO writing top-level names to warprnnt_tensorflow.egg-info/top_level.txt writing dependency_links to warprnnt_tensorflow.egg-info/dependency_links.txt reading manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt' writing manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt' running build_ext copying build/lib.linux-x86_64-3.5/warprnnt_tensorflow/kernels.cpython-35m-x86_64-linux-gnu.so -> warprnnt_tensorflow /data/gengjie/workspace/warp-transducer/tensorflow_binding/setup.py:63: UserWarning: Assuming tensorflow was compiled without C++11 ABI. It is generally true if you are using binary pip package. If you compiled tensorflow from source with gcc >= 5 and didn't set -D_GLIBCXX_USE_CXX11_ABI=0 during compilation, you need to set environment variable TF_CXX11_ABI=1 when compiling this bindings. Also be sure to touch some files in src to trigger recompilation. Also, you need to set (or unsed) this environment variable if getting undefined symbol: _ZN10tensorflow... errors warnings.warn("Assuming tensorflow was compiled without C++11 ABI. " running test running egg_info writing warprnnt_tensorflow.egg-info/PKG-INFO writing top-level names to warprnnt_tensorflow.egg-info/top_level.txt writing dependency_links to warprnnt_tensorflow.egg-info/dependency_links.txt reading manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt' writing manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt' running build_ext copying build/lib.linux-x86_64-3.5/warprnnt_tensorflow/kernels.cpython-35m-x86_64-linux-gnu.so -> warprnnt_tensorflow 2019-07-16 08:09:39.373288: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-07-16 08:09:39.811372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:22:00.0 totalMemory: 15.90GiB freeMemory: 15.34GiB 2019-07-16 08:09:39.811462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-07-16 08:09:40.169127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-07-16 08:09:40.169209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-07-16 08:09:40.169237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-07-16 08:09:40.169744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14862 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:22:00.0, compute capability: 6.0) [4.280653 3.938437] [array([[[[-1.86843961e-01, -6.25548363e-02, 2.49398738e-01], [-2.03376651e-01, 2.02399358e-01, 9.77352262e-04], [-1.41016066e-01, 7.91234598e-02, 6.18926175e-02]],

    [[-1.15517527e-02, -8.12802464e-02,  9.28320065e-02],
     [-1.54257044e-01,  2.29432672e-01, -7.51756430e-02],
     [-2.46593103e-01,  1.46404594e-01,  1.00188486e-01]],

    [[-1.29182935e-02, -6.15932457e-02,  7.45115280e-02],
     [-5.59856892e-02,  2.19830751e-01, -1.63845122e-01],
     [-4.97626871e-01,  2.09239930e-01,  2.88386971e-01]],

    [[ 1.36048598e-02, -3.02196294e-02,  1.66147705e-02],
     [ 1.13924518e-01,  6.27812073e-02, -1.76705718e-01],
     [-6.67078257e-01,  3.67658854e-01,  2.99419403e-01]]],

   [[[-3.56343776e-01, -5.53474724e-02,  4.11691159e-01],
     [-9.69219282e-02,  2.94590741e-02,  6.74628317e-02],
     [-6.35175705e-02,  2.76544970e-02,  3.58630754e-02]],

    [[-1.54498994e-01, -7.39420503e-02,  2.28441045e-01],
     [-1.66789889e-01, -8.79168510e-05,  1.66877776e-01],
     [-1.72369659e-01,  1.05565324e-01,  6.68043196e-02]],

    [[ 2.38748863e-02, -1.18255839e-01,  9.43809301e-02],
     [-1.04707092e-01, -1.08934462e-01,  2.13641584e-01],
     [-3.69844258e-01,  1.80118084e-01,  1.89726144e-01]],

    [[ 2.57137045e-02, -7.94617534e-02,  5.37480488e-02],
     [ 1.22328229e-01, -2.38788679e-01,  1.16460443e-01],
     [-5.98686934e-01,  3.02203149e-01,  2.96483815e-01]]]],
  dtype=float32)]

test_forward (test_warprnnt_op.WarpRNNTTest) ... 2019-07-16 08:09:40.466216: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-07-16 08:09:40.466285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-07-16 08:09:40.466299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-07-16 08:09:40.466309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-07-16 08:09:40.466501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14862 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:22:00.0, compute capability: 6.0) [4.4956665] ok test_multiple_batches_cpu (test_warprnnt_op.WarpRNNTTest) ... /data/gengjie/workspace/warp-transducer/tensorflow_binding/tests/test_warprnnt_op.py:14: DeprecationWarning: Please use assertEqual instead. self.assertEquals(acts.shape, expected_grads.shape) 2019-07-16 08:09:40.505122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-07-16 08:09:40.505221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-07-16 08:09:40.505236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-07-16 08:09:40.505245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-07-16 08:09:40.505559: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14862 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:22:00.0, compute capability: 6.0) ok test_multiple_batches_gpu (test_warprnnt_op.WarpRNNTTest) ... 2019-07-16 08:09:40.522426: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-07-16 08:09:40.522482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-07-16 08:09:40.522506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-07-16 08:09:40.522522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-07-16 08:09:40.522782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 14862 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:22:00.0, compute capability: 6.0) 2019-07-16 08:09:40.538106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-07-16 08:09:40.538146: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-07-16 08:09:40.538159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-07-16 08:09:40.538169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-07-16 08:09:40.538347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14862 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:22:00.0, compute capability: 6.0) FAIL test_session (test_warprnnt_op.WarpRNNTTest) Use cached_session instead. ... ok

====================================================================== FAIL: test_multiple_batches_gpu (test_warprnnt_op.WarpRNNTTest)

Traceback (most recent call last): File "/data/gengjie/workspace/warp-transducer/tensorflow_binding/tests/test_warprnnt_op.py", line 92, in test_multiple_batches_gpu self._test_multiple_batches(use_gpu=True) File "/data/gengjie/workspace/warp-transducer/tensorflow_binding/tests/test_warprnnt_op.py", line 85, in _test_multiple_batches self._run_rnnt(acts, labels, input_lengths, label_lengths, expected_costs, expected_grads, 0, use_gpu) File "/data/gengjie/workspace/warp-transducer/tensorflow_binding/tests/test_warprnnt_op.py", line 27, in _run_rnnt self.assertAllClose(tf_costs, expected_costs, atol=1e-6) File "/data/gengjie/env/lib/python3.5/site-packages/tensorflow/python/framework/test_util.py", line 1591, in assertAllClose self._assertAllCloseRecursive(a, b, rtol=rtol, atol=atol, msg=msg) File "/data/gengjie/env/lib/python3.5/site-packages/tensorflow/python/framework/test_util.py", line 1561, in _assertAllCloseRecursive (path_str, path_str, msg))) File "/data/gengjie/env/lib/python3.5/site-packages/tensorflow/python/framework/test_util.py", line 1496, in _assertArrayLikeAllClose a, b, rtol=rtol, atol=atol, err_msg="\n".join(msgs), equal_nan=True) File "/data/gengjie/env/lib/python3.5/site-packages/numpy/testing/_private/utils.py", line 1501, in assert_allclose verbose=verbose, header=header, equal_nan=equal_nan) File "/data/gengjie/env/lib/python3.5/site-packages/numpy/testing/_private/utils.py", line 827, in assert_array_compare raise AssertionError(msg) AssertionError: Not equal to tolerance rtol=1e-06, atol=1e-06 Mismatched value: a is different from b. not close where = (array([0, 1]),) not close lhs = [-5.3799906 -5.5812006] not close rhs = [4.28065 3.93844] not close dif = [9.660641 9.519641] not close tol = [5.28065e-06 4.93844e-06] dtype = float32, shape = (2,) Mismatch: 100% Max absolute difference: 9.660641 Max relative difference: 2.4171095 x: array([-5.379991, -5.581201], dtype=float32) y: array([4.28065, 3.93844], dtype=float32)


Ran 4 tests in 0.095s

FAILED (failures=1)

qppp558 commented 5 years ago

It seems that I do not assign the $CUDA_HOME when building the package.

chuikova-e commented 5 years ago

I assign $CUDA_HOME, but have the same problem

kobenaxie commented 4 years ago

same error, python 3.6.0, cuda 9.0,

yjiangling commented 4 years ago

I assign $CUDA_HOME, but have the same problem

So, How did you solve it?