Open FduJyy opened 6 years ago
Which NCCL library did you clone? This is the script we use to install the NCCL that we build against https://github.com/caffe2/caffe2/blob/master/docker/jenkins/common/install_nccl.sh . This should be the library that you need http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1404_4.0-2_amd64.deb . If you call that script with UBUNTU_VERSION=16.04 and CUDA_VERSION=9.0 then it should install correctly
@pjh5 Thanks for your help! Now I can run from caffe2.python import workspace
without errors.
Next I tried to use the Detectron platform. However when I finished installing dependencies and ran the SpatialNarrowAsOp test, I met another problem Encountered CUDA error: no kernel image is available for execution on the device Error from operator: input: "A" input: "B" input: "C_grad" output: "A_grad" name: "" type: "SpatialNarrowAsGradient" device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true
. Wish to know what caused that problem?
(caffe) jyy@jyy-OptiPlex-9020:~/Detectron$ python ./tests/test_spatial_narrow_as_op.py
E0309 14:17:00.375676 3086 init_intrinsics_check.cc:59] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0309 14:17:00.375697 3086 init_intrinsics_check.cc:59] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0309 14:17:00.375700 3086 init_intrinsics_check.cc:59] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
Found Detectron ops lib: /home/jyy/anaconda3/envs/caffe/lib/libcaffe2_detectron_ops_gpu.so
F.E
======================================================================
ERROR: test_small_forward_and_gradient (__main__.SpatialNarrowAsOpTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "./tests/test_spatial_narrow_as_op.py", line 59, in test_small_forward_and_gradient
self._run_test(A, B, check_grad=True)
File "./tests/test_spatial_narrow_as_op.py", line 49, in _run_test
res, grad, grad_estimated = gc.CheckSimple(op, [A, B], 0, [0])
File "/home/jyy/anaconda3/envs/caffe/lib/python2.7/site-packages/caffe2/python/gradient_checker.py", line 284, in CheckSimple
outputs_with_grads
File "/home/jyy/anaconda3/envs/caffe/lib/python2.7/site-packages/caffe2/python/gradient_checker.py", line 201, in GetLossAndGrad
workspace.RunOperatorsOnce(grad_ops)
File "/home/jyy/anaconda3/envs/caffe/lib/python2.7/site-packages/caffe2/python/workspace.py", line 184, in RunOperatorsOnce
success = RunOperatorOnce(op)
File "/home/jyy/anaconda3/envs/caffe/lib/python2.7/site-packages/caffe2/python/workspace.py", line 179, in RunOperatorOnce
return C.run_operator_once(StringifyProto(operator))
RuntimeError: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "A" input: "B" input: "C_grad" output: "A_grad" name: "" type: "SpatialNarrowAsGradient" device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true
======================================================================
FAIL: test_large_forward (__main__.SpatialNarrowAsOpTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "./tests/test_spatial_narrow_as_op.py", line 68, in test_large_forward
self._run_test(A, B)
File "./tests/test_spatial_narrow_as_op.py", line 54, in _run_test
np.testing.assert_allclose(C, C_ref, rtol=1e-5, atol=1e-08)
File "/home/jyy/anaconda3/envs/caffe/lib/python2.7/site-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/home/jyy/anaconda3/envs/caffe/lib/python2.7/site-packages/numpy/testing/nose_tools/utils.py", line 779, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-08
(mismatch 100.0%)
x: array([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],...
y: array([[[[ 1.707480e+00, 1.710607e+00, 1.279160e+00, ...,
-9.014695e-01, -1.781531e+00, 4.036736e-01],
[ 1.895508e+00, -3.324545e-01, 3.578335e-01, ...,...
----------------------------------------------------------------------
Ran 3 tests in 0.557s
FAILED (failures=1, errors=1)
@FduJyy can you try running this on CUDA 8?
@orionr should this work in CUDA 9 right now?
I also encountered this problem, but when I compile caffe2 from source, there is no problem.
Traceback (most recent call last): File "./tests/test_spatial_narrow_as_op.py", line 59, in test_small_forward_and_gradient self._run_test(A, B, check_grad=True) File "./tests/test_spatial_narrow_as_op.py", line 49, in _run_test res, grad, grad_estimated = gc.CheckSimple(op, [A, B], 0, [0]) File "/home/jyy/anaconda3/envs/caffe/lib/python2.7/site-packages/caffe2/python/gradient_checker.py", line 284, in CheckSimple outputs_with_grads File "/home/jyy/anaconda3/envs/caffe/lib/python2.7/site-packages/caffe2/python/gradient_checker.py", line 201, in GetLossAndGrad workspace.RunOperatorsOnce(grad_ops) File "/home/jyy/anaconda3/envs/caffe/lib/python2.7/site-packages/caffe2/python/workspace.py", line 184, in RunOperatorsOnce success = RunOperatorOnce(op) File "/home/jyy/anaconda3/envs/caffe/lib/python2.7/site-packages/caffe2/python/workspace.py", line 179, in RunOperatorOnce return C.run_operator_once(StringifyProto(operator)) RuntimeError: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator: input: "A" input: "B" input: "C_grad" output: "A_grad" name: "" type: "SpatialNarrowAsGradient" device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true
Traceback (most recent call last): File "./tests/test_spatial_narrow_as_op.py", line 68, in test_large_forward self._run_test(A, B) File "./tests/test_spatial_narrow_as_op.py", line 54, in _run_test np.testing.assert_allclose(C, C_ref, rtol=1e-5, atol=1e-08) File "/home/jyy/anaconda3/envs/caffe/lib/python2.7/site-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose verbose=verbose, header=header, equal_nan=equal_nan) File "/home/jyy/anaconda3/envs/caffe/lib/python2.7/site-packages/numpy/testing/nose_tools/utils.py", line 779, in assert_array_compare raise AssertionError(msg) AssertionError: Not equal to tolerance rtol=1e-05, atol=1e-08
What does your CUDA installation look like? Can you ls -lah
the folder where CUDA is installed? You can probably find it with find / -name libcuda*
@pjh5 You might mean my CUDA installation? It shows here.
jyy@jyy:/usr/local/cuda-9.0$ ls -lah
drwxr-xr-x 18 root root 4.0K 3月 8 22:05 .
drwxr-xr-x 13 root root 4.0K 3月 8 22:02 ..
drwxr-xr-x 3 root root 4.0K 3月 8 22:02 bin
drwxr-xr-x 5 root root 4.0K 3月 8 22:02 doc
drwxr-xr-x 5 root root 4.0K 3月 8 22:02 extras
drwxr-xr-x 5 root root 4.0K 3月 9 21:00 include
drwxr-xr-x 5 root root 4.0K 3月 8 22:02 jre
drwxr-xr-x 3 root root 4.0K 3月 9 22:59 lib64
drwxr-xr-x 8 root root 4.0K 3月 8 22:02 libnsight
drwxr-xr-x 7 root root 4.0K 3月 8 22:02 libnvvp
drwxr-xr-x 2 root root 4.0K 3月 8 22:02 nsightee_plugins
-r--r--r-- 1 root root 39K 3月 8 22:59 NVIDIA_SLA_cuDNN_Support.txt
drwxr-xr-x 3 root root 4.0K 3月 8 22:02 nvml
drwxr-xr-x 7 root root 4.0K 3月 8 22:02 nvvm
drwxr-xr-x 2 root root 4.0K 3月 8 22:02 pkgconfig
drwxr-xr-x 11 root root 4.0K 3月 8 22:02 samples
drwxr-xr-x 3 root root 4.0K 3月 8 22:02 share
drwxr-xr-x 2 root root 4.0K 3月 8 22:02 src
drwxr-xr-x 2 root root 4.0K 3月 8 22:02 tools
-rw-r--r-- 1 root root 21 3月 8 22:02 version.txt
jyy@jyy:/usr/local/cuda-9.0$ ls -lah lib64
drwxr-xr-x 3 root root 4.0K 3月 9 22:59 .
drwxr-xr-x 18 root root 4.0K 3月 8 22:05 ..
lrwxrwxrwx 1 root root 18 3月 8 22:02 libaccinj64.so -> libaccinj64.so.9.0
lrwxrwxrwx 1 root root 22 3月 8 22:02 libaccinj64.so.9.0 -> libaccinj64.so.9.0.176
-rwxr-xr-x 1 root root 6.6M 3月 8 22:02 libaccinj64.so.9.0.176
-rw-r--r-- 1 root root 67M 3月 8 22:02 libcublas_device.a
lrwxrwxrwx 1 root root 16 3月 8 22:02 libcublas.so -> libcublas.so.9.0
lrwxrwxrwx 1 root root 20 3月 8 22:02 libcublas.so.9.0 -> libcublas.so.9.0.176
-rwxr-xr-x 1 root root 51M 3月 8 22:02 libcublas.so.9.0.176
-rw-r--r-- 1 root root 57M 3月 8 22:02 libcublas_static.a
-rw-r--r-- 1 root root 624K 3月 8 22:02 libcudadevrt.a
lrwxrwxrwx 1 root root 16 3月 8 22:02 libcudart.so -> libcudart.so.9.0
lrwxrwxrwx 1 root root 20 3月 8 22:02 libcudart.so.9.0 -> libcudart.so.9.0.176
-rwxr-xr-x 1 root root 433K 3月 8 22:02 libcudart.so.9.0.176
-rw-r--r-- 1 root root 812K 3月 8 22:02 libcudart_static.a
-rwxr-xr-x 1 root root 306M 3月 9 22:59 libcudnn.so
-rwxr-xr-x 1 root root 306M 3月 9 22:59 libcudnn.so.7
-rwxr-xr-x 1 root root 275M 3月 9 21:00 libcudnn.so.7.0.5
-rwxr-xr-x 1 root root 306M 3月 9 22:59 libcudnn.so.7.1.1
-rw-r--r-- 1 root root 302M 3月 9 23:00 libcudnn_static.a
lrwxrwxrwx 1 root root 15 3月 8 22:02 libcufft.so -> libcufft.so.9.0
lrwxrwxrwx 1 root root 19 3月 8 22:02 libcufft.so.9.0 -> libcufft.so.9.0.176
-rwxr-xr-x 1 root root 127M 3月 8 22:02 libcufft.so.9.0.176
-rw-r--r-- 1 root root 131M 3月 8 22:02 libcufft_static.a
lrwxrwxrwx 1 root root 16 3月 8 22:02 libcufftw.so -> libcufftw.so.9.0
lrwxrwxrwx 1 root root 20 3月 8 22:02 libcufftw.so.9.0 -> libcufftw.so.9.0.176
-rwxr-xr-x 1 root root 496K 3月 8 22:02 libcufftw.so.9.0.176
-rw-r--r-- 1 root root 41K 3月 8 22:02 libcufftw_static.a
lrwxrwxrwx 1 root root 17 3月 8 22:02 libcuinj64.so -> libcuinj64.so.9.0
lrwxrwxrwx 1 root root 21 3月 8 22:02 libcuinj64.so.9.0 -> libcuinj64.so.9.0.176
-rwxr-xr-x 1 root root 6.9M 3月 8 22:02 libcuinj64.so.9.0.176
-rw-r--r-- 1 root root 1.6M 3月 8 22:02 libculibos.a
lrwxrwxrwx 1 root root 16 3月 8 22:02 libcurand.so -> libcurand.so.9.0
lrwxrwxrwx 1 root root 20 3月 8 22:02 libcurand.so.9.0 -> libcurand.so.9.0.176
-rwxr-xr-x 1 root root 57M 3月 8 22:02 libcurand.so.9.0.176
-rw-r--r-- 1 root root 57M 3月 8 22:02 libcurand_static.a
lrwxrwxrwx 1 root root 18 3月 8 22:02 libcusolver.so -> libcusolver.so.9.0
lrwxrwxrwx 1 root root 22 3月 8 22:02 libcusolver.so.9.0 -> libcusolver.so.9.0.176
-rwxr-xr-x 1 root root 74M 3月 8 22:02 libcusolver.so.9.0.176
-rw-r--r-- 1 root root 34M 3月 8 22:02 libcusolver_static.a
lrwxrwxrwx 1 root root 18 3月 8 22:02 libcusparse.so -> libcusparse.so.9.0
lrwxrwxrwx 1 root root 22 3月 8 22:02 libcusparse.so.9.0 -> libcusparse.so.9.0.176
-rwxr-xr-x 1 root root 54M 3月 8 22:02 libcusparse.so.9.0.176
-rw-r--r-- 1 root root 62M 3月 8 22:02 libcusparse_static.a
lrwxrwxrwx 1 root root 14 3月 8 22:02 libnppc.so -> libnppc.so.9.0
lrwxrwxrwx 1 root root 18 3月 8 22:02 libnppc.so.9.0 -> libnppc.so.9.0.176
-rwxr-xr-x 1 root root 478K 3月 8 22:02 libnppc.so.9.0.176
-rw-r--r-- 1 root root 24K 3月 8 22:02 libnppc_static.a
lrwxrwxrwx 1 root root 16 3月 8 22:02 libnppial.so -> libnppial.so.9.0
lrwxrwxrwx 1 root root 20 3月 8 22:02 libnppial.so.9.0 -> libnppial.so.9.0.176
-rwxr-xr-x 1 root root 11M 3月 8 22:02 libnppial.so.9.0.176
-rw-r--r-- 1 root root 16M 3月 8 22:02 libnppial_static.a
lrwxrwxrwx 1 root root 16 3月 8 22:02 libnppicc.so -> libnppicc.so.9.0
lrwxrwxrwx 1 root root 20 3月 8 22:02 libnppicc.so.9.0 -> libnppicc.so.9.0.176
-rwxr-xr-x 1 root root 4.1M 3月 8 22:02 libnppicc.so.9.0.176
-rw-r--r-- 1 root root 4.8M 3月 8 22:02 libnppicc_static.a
lrwxrwxrwx 1 root root 17 3月 8 22:02 libnppicom.so -> libnppicom.so.9.0
lrwxrwxrwx 1 root root 21 3月 8 22:02 libnppicom.so.9.0 -> libnppicom.so.9.0.176
-rwxr-xr-x 1 root root 1.3M 3月 8 22:02 libnppicom.so.9.0.176
-rw-r--r-- 1 root root 1011K 3月 8 22:02 libnppicom_static.a
lrwxrwxrwx 1 root root 17 3月 8 22:02 libnppidei.so -> libnppidei.so.9.0
lrwxrwxrwx 1 root root 21 3月 8 22:02 libnppidei.so.9.0 -> libnppidei.so.9.0.176
-rwxr-xr-x 1 root root 7.5M 3月 8 22:02 libnppidei.so.9.0.176
-rw-r--r-- 1 root root 11M 3月 8 22:02 libnppidei_static.a
lrwxrwxrwx 1 root root 15 3月 8 22:02 libnppif.so -> libnppif.so.9.0
lrwxrwxrwx 1 root root 19 3月 8 22:02 libnppif.so.9.0 -> libnppif.so.9.0.176
-rwxr-xr-x 1 root root 55M 3月 8 22:02 libnppif.so.9.0.176
-rw-r--r-- 1 root root 60M 3月 8 22:02 libnppif_static.a
lrwxrwxrwx 1 root root 15 3月 8 22:02 libnppig.so -> libnppig.so.9.0
lrwxrwxrwx 1 root root 19 3月 8 22:02 libnppig.so.9.0 -> libnppig.so.9.0.176
-rwxr-xr-x 1 root root 27M 3月 8 22:02 libnppig.so.9.0.176
-rw-r--r-- 1 root root 30M 3月 8 22:02 libnppig_static.a
lrwxrwxrwx 1 root root 15 3月 8 22:02 libnppim.so -> libnppim.so.9.0
lrwxrwxrwx 1 root root 19 3月 8 22:02 libnppim.so.9.0 -> libnppim.so.9.0.176
-rwxr-xr-x 1 root root 4.9M 3月 8 22:02 libnppim.so.9.0.176
-rw-r--r-- 1 root root 4.9M 3月 8 22:02 libnppim_static.a
lrwxrwxrwx 1 root root 16 3月 8 22:02 libnppist.so -> libnppist.so.9.0
lrwxrwxrwx 1 root root 20 3月 8 22:02 libnppist.so.9.0 -> libnppist.so.9.0.176
-rwxr-xr-x 1 root root 15M 3月 8 22:02 libnppist.so.9.0.176
-rw-r--r-- 1 root root 20M 3月 8 22:02 libnppist_static.a
lrwxrwxrwx 1 root root 16 3月 8 22:02 libnppisu.so -> libnppisu.so.9.0
lrwxrwxrwx 1 root root 20 3月 8 22:02 libnppisu.so.9.0 -> libnppisu.so.9.0.176
-rwxr-xr-x 1 root root 467K 3月 8 22:02 libnppisu.so.9.0.176
-rw-r--r-- 1 root root 11K 3月 8 22:02 libnppisu_static.a
lrwxrwxrwx 1 root root 16 3月 8 22:02 libnppitc.so -> libnppitc.so.9.0
lrwxrwxrwx 1 root root 20 3月 8 22:02 libnppitc.so.9.0 -> libnppitc.so.9.0.176
-rwxr-xr-x 1 root root 2.9M 3月 8 22:02 libnppitc.so.9.0.176
-rw-r--r-- 1 root root 3.9M 3月 8 22:02 libnppitc_static.a
lrwxrwxrwx 1 root root 14 3月 8 22:02 libnpps.so -> libnpps.so.9.0
lrwxrwxrwx 1 root root 18 3月 8 22:02 libnpps.so.9.0 -> libnpps.so.9.0.176
-rwxr-xr-x 1 root root 8.9M 3月 8 22:02 libnpps.so.9.0.176
-rw-r--r-- 1 root root 12M 3月 8 22:02 libnpps_static.a
lrwxrwxrwx 1 root root 16 3月 8 22:02 libnvblas.so -> libnvblas.so.9.0
lrwxrwxrwx 1 root root 20 3月 8 22:02 libnvblas.so.9.0 -> libnvblas.so.9.0.176
-rwxr-xr-x 1 root root 519K 3月 8 22:02 libnvblas.so.9.0.176
lrwxrwxrwx 1 root root 17 3月 8 22:02 libnvgraph.so -> libnvgraph.so.9.0
lrwxrwxrwx 1 root root 21 3月 8 22:02 libnvgraph.so.9.0 -> libnvgraph.so.9.0.176
-rwxr-xr-x 1 root root 23M 3月 8 22:02 libnvgraph.so.9.0.176
-rw-r--r-- 1 root root 53M 3月 8 22:02 libnvgraph_static.a
lrwxrwxrwx 1 root root 24 3月 8 22:02 libnvrtc-builtins.so -> libnvrtc-builtins.so.9.0
lrwxrwxrwx 1 root root 28 3月 8 22:02 libnvrtc-builtins.so.9.0 -> libnvrtc-builtins.so.9.0.176
-rwxr-xr-x 1 root root 3.2M 3月 8 22:02 libnvrtc-builtins.so.9.0.176
lrwxrwxrwx 1 root root 15 3月 8 22:02 libnvrtc.so -> libnvrtc.so.9.0
lrwxrwxrwx 1 root root 19 3月 8 22:02 libnvrtc.so.9.0 -> libnvrtc.so.9.0.176
-rwxr-xr-x 1 root root 22M 3月 8 22:02 libnvrtc.so.9.0.176
lrwxrwxrwx 1 root root 18 3月 8 22:02 libnvToolsExt.so -> libnvToolsExt.so.1
lrwxrwxrwx 1 root root 22 3月 8 22:02 libnvToolsExt.so.1 -> libnvToolsExt.so.1.0.0
-rwxr-xr-x 1 root root 37K 3月 8 22:02 libnvToolsExt.so.1.0.0
lrwxrwxrwx 1 root root 14 3月 8 22:02 libOpenCL.so -> libOpenCL.so.1
lrwxrwxrwx 1 root root 16 3月 8 22:02 libOpenCL.so.1 -> libOpenCL.so.1.0
lrwxrwxrwx 1 root root 18 3月 8 22:02 libOpenCL.so.1.0 -> libOpenCL.so.1.0.0
-rw-r--r-- 1 root root 26K 3月 8 22:02 libOpenCL.so.1.0.0
drwxr-xr-x 2 root root 4.0K 3月 8 22:02 stubs
Do these run if you use
CAFFE2_PYPATH=/home/jyy/anaconda3/envs/caffe/lib/python2.7/site-packages/caffe2/python/ python \
-m pytest \
-x \
-v \
-s \
--ignore "$CAFFE2_PYPATH/test/executor_test.py" \
--ignore "$CAFFE2_PYPATH/operator_test/matmul_op_test.py" \
--ignore "$CAFFE2_PYPATH/operator_test/pack_ops_test.py" \
--ignore "$CAFFE2_PYPATH/mkl/mkl_sbn_speed_test.py" \
"$CAFFE2_PYPATH"
I have the same problem as @FduJyy on Detectron.
When I run your suggested tests I get this:
============================= test session starts ============================= platform linux2 -- Python 2.7.14, pytest-3.5.0, py-1.5.3, pluggy-0.6.0 -- /home/joaofayad/anaconda3/envs/detectron/bin/python cachedir: .pytest_cache rootdir: /home/joaofayad/detectron, inifile: collected 34 items
lib/core/test_engine.py::test_net_on_dataset ERROR [ 2%]
=================================== ERRORS ==================================== ____ ERROR at setup of test_net_on_dataset ____ file /home/joaofayad/detectron/lib/core/test_engine.py, line 126 def test_net_on_dataset( E fixture 'weights_file' not found > available fixtures: cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, monkeypatch, pytestconfig, record_property, record_xml_attribute, record_xml_property, recwarn, tmpdir, tmpdir_factory > use 'pytest --fixtures [testpath]' for help on them.
@pjh5 @joaofayad Sorry for replying late. Thanks to @NovenBae's advice, I solved this problem by compiling caffe2 from source using conda build (referring to the official website). Now everything is OK and Detectron runs well.
@pjh5 @joaofayad @FduJyy I am having the same problem. I have installed my caffe2 with the pre-bild binaries code. Everything is working and the GPU test is returning 1. And when I run the detectron installation test, I am encountering this same FAILED (failures=1, errors=1) error. I have used CUDA8 and cuDNN7 for the installation. When I run pjh5's test, they failed with ImportError. I am using an Azure DSVM, and the X2Go interface is not letting me copy-paste. So I took a snippet of the screen: Now, I am going to try reinstalling the caffe2 from the main website (build from source). @FduJyy was this what you meant? Without using the conda install -c caffe2 caffe2-cuda8.0-cudnn7 and instead using a list of pip install commands? Thank you!
@FduJyy I met the exact same problem and was able to solve it with your method. Thanks!
If this is a build issue, please fill out the template below.
System information
I installed Caffe2 via pre-built binaries using
conda install -c caffe2 caffe2-cuda9.0-cudnn7
and came across a problem. It seems that a file called "libnccl.so.2" is missing. I cloned the nccl library and compiled it but didn't find any file called "libnccl.so.2". This problem is still unsolved.