NVIDIA / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
672 stars 263 forks source link

Caffe 0.17.2 complilation still fails with NCCL same errors #545

Closed becauseofAI closed 5 years ago

becauseofAI commented 5 years ago

System details: Ubuntu 16.0.4.5 CUDA 9.0 NLCC libnccl2 (2.3.7) Caffe 0.17.2

Error details:

caffe-0.17.2$ make clean
caffe-0.17.2$ make all -j32
PROTOC src/caffe/proto/caffe.proto
CXX src/caffe/internal_thread.cpp
CXX src/caffe/parallel.cpp
CXX src/caffe/layer_factory.cpp
CXX src/caffe/solvers/rmsprop_solver.cpp
CXX src/caffe/solvers/sgd_solver.cpp
CXX src/caffe/solvers/adadelta_solver.cpp
CXX src/caffe/solvers/adam_solver.cpp
CXX src/caffe/solvers/sag_solver.cpp
CXX src/caffe/solvers/adagrad_solver.cpp
CXX src/caffe/solvers/nesterov_solver.cpp
CXX src/caffe/common.cpp
CXX src/caffe/blob.cpp
CXX src/caffe/util/detectnet_coverage_rectangular.cpp
CXX src/caffe/util/insert_splits.cpp
CXX src/caffe/util/im2col.cpp
CXX src/caffe/util/hdf5.cpp
CXX src/caffe/util/im_transforms.cpp
CXX src/caffe/util/blocking_queue.cpp
CXX src/caffe/util/io.cpp
CXX src/caffe/util/benchmark.cpp
CXX src/caffe/util/sampler.cpp
CXX src/caffe/util/bbox_util.cpp
CXX src/caffe/util/upgrade_proto.cpp
CXX src/caffe/util/db_lmdb.cpp
CXX src/caffe/util/db_leveldb.cpp
CXX src/caffe/util/cudnn.cpp
CXX src/caffe/util/signal_handler.cpp
CXX src/caffe/util/db.cpp
CXX src/caffe/util/math_functions.cpp
CXX src/caffe/util/gpu_memory.cpp
CXX src/caffe/solver.cpp
CXX src/caffe/data_transformer.cpp
CXX src/caffe/type.cpp
CXX src/caffe/net.cpp
CXX src/caffe/syncedmem.cpp
In file included from src/caffe/parallel.cpp:2:0:
src/caffe/parallel.cpp: In constructor ‘caffe::P2PManager::P2PManager(boost::shared_ptr<caffe::Solver>, int, int, const caffe::SolverParameter&)’:
src/caffe/parallel.cpp:11:25: error: ‘NCCL_MAJOR’ was not declared in this scope
 #define CAFFE_NCCL_VER (NCCL_MAJOR*10000 + NCCL_MINOR*100)
                         ^
src/caffe/parallel.cpp:61:17: note: in expansion of macro ‘CAFFE_NCCL_VER’
   LOG_IF(FATAL, CAFFE_NCCL_VER < 20200) << "NCCL 2.2 or higher is required";
                 ^
src/caffe/parallel.cpp:11:44: error: ‘NCCL_MINOR’ was not declared in this scope
 #define CAFFE_NCCL_VER (NCCL_MAJOR*10000 + NCCL_MINOR*100)
                                            ^
src/caffe/parallel.cpp:61:17: note: in expansion of macro ‘CAFFE_NCCL_VER’
   LOG_IF(FATAL, CAFFE_NCCL_VER < 20200) << "NCCL 2.2 or higher is required";
                 ^
In file included from ./include/caffe/parallel.hpp:23:0,
                 from ./include/caffe/caffe.hpp:13,
                 from src/caffe/parallel.cpp:6:
src/caffe/parallel.cpp: In member function ‘virtual void caffe::P2PSync::on_start(const std::vector<boost::shared_ptr<caffe::Blob> >&)’:
src/caffe/parallel.cpp:248:31: error: ‘ncclGroupStart’ was not declared in this scope
     NCCL_CHECK(ncclGroupStart());
                               ^
./include/caffe/util/nccl.hpp:10:25: note: in definition of macro ‘NCCL_CHECK’
   ncclResult_t result = condition; \
                         ^
src/caffe/parallel.cpp:255:29: error: ‘ncclGroupEnd’ was not declared in this scope
     NCCL_CHECK(ncclGroupEnd());
                             ^
./include/caffe/util/nccl.hpp:10:25: note: in definition of macro ‘NCCL_CHECK’
   ncclResult_t result = condition; \
                         ^
Makefile:610: recipe for target '.build_release/src/caffe/parallel.o' failed
make: *** [.build_release/src/caffe/parallel.o] Error 1
make: *** Waiting for unfinished jobs....
In file included from ./include/caffe/parallel.hpp:23:0,
                 from src/caffe/net.cpp:14:
src/caffe/net.cpp: In member function ‘void caffe::Net::Reduce(int)’:
src/caffe/net.cpp:955:29: error: ‘ncclGroupStart’ was not declared in this scope
   NCCL_CHECK(ncclGroupStart());
                             ^
./include/caffe/util/nccl.hpp:10:25: note: in definition of macro ‘NCCL_CHECK’
   ncclResult_t result = condition; \
                         ^
src/caffe/net.cpp:957:27: error: ‘ncclGroupEnd’ was not declared in this scope
   NCCL_CHECK(ncclGroupEnd());
                           ^
./include/caffe/util/nccl.hpp:10:25: note: in definition of macro ‘NCCL_CHECK’
   ncclResult_t result = condition; \
                         ^
src/caffe/net.cpp: In member function ‘void caffe::Net::ReduceBucket(size_t, caffe::Type, void*)’:
src/caffe/net.cpp:969:29: error: ‘ncclGroupStart’ was not declared in this scope
   NCCL_CHECK(ncclGroupStart());
                             ^
./include/caffe/util/nccl.hpp:10:25: note: in definition of macro ‘NCCL_CHECK’
   ncclResult_t result = condition; \
                         ^
src/caffe/net.cpp:971:27: error: ‘ncclGroupEnd’ was not declared in this scope
   NCCL_CHECK(ncclGroupEnd());
                           ^
./include/caffe/util/nccl.hpp:10:25: note: in definition of macro ‘NCCL_CHECK’
   ncclResult_t result = condition; \
                         ^
Makefile:610: recipe for target '.build_release/src/caffe/net.o' failed
make: *** [.build_release/src/caffe/net.o] Error 1

This error has been fixed in the issue #540, but why does it still encounter same NCCL errors? Any help will be grateful.

drnikolaev commented 5 years ago

Hi @becauseofAI this is strange. Have you installed NCCL using DEB package? Does /usr/include/nccl.h exist? If nothing works, could you try CMake?

becauseofAI commented 5 years ago

I had installed NCCL as follow:

sudo dpkg -i nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb

or

sudo dpkg -i nccl-repo-ubuntu1604-2.3.7-ga-cuda9.0_1-1_amd64.deb

then

sudo apt update
sudo apt install libnccl2=2.3.7-1+cuda9.0 libnccl-dev=2.3.7-1+cuda9.0

The successful installation results details:

Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be upgraded:
  libnccl-dev libnccl2
2 upgraded, 0 newly installed, 0 to remove and 70 not upgraded.
Need to get 0 B/51.8 MB of archives.
After this operation, 17.4 kB of additional disk space will be used.
Get:1 file:/var/nccl-repo-2.3.7-ga-cuda9.0  libnccl-dev 2.3.7-1+cuda9.0 [22.6 MB]
Get:2 file:/var/nccl-repo-2.3.7-ga-cuda9.0  libnccl2 2.3.7-1+cuda9.0 [29.1 MB]
(Reading database ... 262931 files and directories currently installed.)
Preparing to unpack .../libnccl-dev_2.3.7-1+cuda9.0_amd64.deb ...
Unpacking libnccl-dev (2.3.7-1+cuda9.0) over (2.3.5-2+cuda9.0) ...
Preparing to unpack .../libnccl2_2.3.7-1+cuda9.0_amd64.deb ...
Unpacking libnccl2 (2.3.7-1+cuda9.0) over (2.3.5-2+cuda9.0) ...
Processing triggers for libc-bin (2.23-0ubuntu10) ...
Setting up libnccl2 (2.3.7-1+cuda9.0) ...
Setting up libnccl-dev (2.3.7-1+cuda9.0) ...
Processing triggers for libc-bin (2.23-0ubuntu10) ...

And the /usr/include/nccl.his existed:

ls /usr/include/nccl*
/usr/include/nccl.h

And I also try CMake:

caffe-0.17.2$ 
mkdir build
cd build 
cmake ..
make -j32

But NCCL still has errors, details as follow:

... ...
[ 44%] Building CXX object src/caffe/CMakeFiles/caffe.dir/util/sampler.cpp.o
In file included from /home/dev/caffe-0.17.2/include/caffe/parallel.hpp:23:0,
                 from /home/dev/caffe-0.17.2/include/caffe/caffe.hpp:13,
                 from /home/dev/caffe-0.17.2/src/caffe/parallel.cpp:6:
/home/dev/caffe-0.17.2/src/caffe/parallel.cpp: In member function ‘virtual void caffe::P2PSync::on_start(const std::vector<boost::shared_ptr<caffe::Blob> >&)’:
/home/dev/caffe-0.17.2/src/caffe/parallel.cpp:252:31: error: ‘ncclGroupStart’ was not declared in this scope
     NCCL_CHECK(ncclGroupStart());
                               ^
/home/dev/caffe-0.17.2/include/caffe/util/nccl.hpp:10:25: note: in definition of macro ‘NCCL_CHECK’
   ncclResult_t result = condition; \
                         ^
/home/dev/caffe-0.17.2/src/caffe/parallel.cpp:259:29: error: ‘ncclGroupEnd’ was not declared in this scope
     NCCL_CHECK(ncclGroupEnd());
                             ^
/home/dev/caffe-0.17.2/include/caffe/util/nccl.hpp:10:25: note: in definition of macro ‘NCCL_CHECK’
   ncclResult_t result = condition; \
                         ^
[ 44%] Building CXX object src/caffe/CMakeFiles/caffe.dir/type.cpp.o
[ 45%] Building CXX object src/caffe/CMakeFiles/caffe.dir/net.cpp.o
src/caffe/CMakeFiles/caffe.dir/build.make:625: recipe for target 'src/caffe/CMakeFiles/caffe.dir/parallel.cpp.o' failed
make[2]: *** [src/caffe/CMakeFiles/caffe.dir/parallel.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
In file included from /home/dev/caffe-0.17.2/include/caffe/parallel.hpp:23:0,
                 from /home/dev/caffe-0.17.2/src/caffe/net.cpp:14:
/home/dev/caffe-0.17.2/src/caffe/net.cpp: In member function ‘void caffe::Net::Reduce(int)’:
/home/dev/caffe-0.17.2/src/caffe/net.cpp:955:29: error: ‘ncclGroupStart’ was not declared in this scope
   NCCL_CHECK(ncclGroupStart());
                             ^
/home/dev/caffe-0.17.2/include/caffe/util/nccl.hpp:10:25: note: in definition of macro ‘NCCL_CHECK’
   ncclResult_t result = condition; \
                         ^
/home/dev/caffe-0.17.2/src/caffe/net.cpp:957:27: error: ‘ncclGroupEnd’ was not declared in this scope
   NCCL_CHECK(ncclGroupEnd());
                           ^
/home/dev/caffe-0.17.2/include/caffe/util/nccl.hpp:10:25: note: in definition of macro ‘NCCL_CHECK’
   ncclResult_t result = condition; \
                         ^
/home/dev/caffe-0.17.2/src/caffe/net.cpp: In member function ‘void caffe::Net::ReduceBucket(size_t, caffe::Type, void*)’:
/home/dev/caffe-0.17.2/src/caffe/net.cpp:969:29: error: ‘ncclGroupStart’ was not declared in this scope
   NCCL_CHECK(ncclGroupStart());
                             ^
/home/dev/caffe-0.17.2/include/caffe/util/nccl.hpp:10:25: note: in definition of macro ‘NCCL_CHECK’
   ncclResult_t result = condition; \
                         ^
/home/dev/caffe-0.17.2/src/caffe/net.cpp:971:27: error: ‘ncclGroupEnd’ was not declared in this scope
   NCCL_CHECK(ncclGroupEnd());
                           ^
/home/dev/caffe-0.17.2/include/caffe/util/nccl.hpp:10:25: note: in definition of macro ‘NCCL_CHECK’
   ncclResult_t result = condition; \
                         ^
src/caffe/CMakeFiles/caffe.dir/build.make:1393: recipe for target 'src/caffe/CMakeFiles/caffe.dir/net.cpp.o' failed
make[2]: *** [src/caffe/CMakeFiles/caffe.dir/net.cpp.o] Error 1
CMakeFiles/Makefile2:272: recipe for target 'src/caffe/CMakeFiles/caffe.dir/all' failed
make[1]: *** [src/caffe/CMakeFiles/caffe.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

So what are the possible problems now? @drnikolaev

drnikolaev commented 5 years ago

@becauseofAI could you please paste here the output you get from cmake .. command? I tried to reproduce this issue but couldn't so far.

Also, could you run and compare:

$ grep 'ncclResult_t ncclGroupStart' /usr/include/nccl.h 
ncclResult_t ncclGroupStart();
becauseofAI commented 5 years ago

@drnikolaev Thank you for your patience.

The output of cmake .. command as follow:

caffe-0.17.2/build$ cmake ..
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Boost version: 1.58.0
-- Found the following Boost libraries:
--   system
--   thread
--   filesystem
--   regex
--   chrono
--   date_time
--   atomic
-- Found GFlags: /usr/include
-- Found gflags  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libgflags.so)
-- Found Glog: /usr/include
-- Found glog    (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libglog.so)
-- Found Protobuf: /usr/local/lib/libprotobuf.a
-- Found PROTOBUF Compiler: /usr/local/bin/protoc
-- Found HDF5: /usr/lib/x86_64-linux-gnu/hdf5/serial/lib/libhdf5_hl.so;/usr/lib/x86_64-linux-gnu/hdf5/serial/lib/libhdf5.so;/usr/lib/x86_64-linux-gnu/libpthread.so;/usr/lib/x86_64-linux-gnu/libsz.so;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so (found version "1.8.16")
-- Found LMDB: /usr/include
-- Found lmdb    (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/liblmdb.so)
-- Found LevelDB: /usr/include
-- Found LevelDB (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libleveldb.so)
-- Found Snappy: /usr/include
-- Found Snappy  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libsnappy.so)
-- Found JPEGTurbo: /usr/include
-- Found JPEGTurbo  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libturbojpeg.so.0)
-- CUDA detected: 9.0
-- Found CUDNN: /usr/local/cuda/lib64/libcudnn.so (found version "7.2")
-- Added CUDA NVCC flags for: sm_70
-- Found OpenCV 2.x: /usr/share/OpenCV
-- Found OpenBLAS libraries: /usr/lib/libopenblas.so
-- Found OpenBLAS include: /usr/include
-- Found PythonInterp: /usr/bin/python2 (found suitable version "2.7.12", minimum required is "2")
-- Found Boost Python Library /usr/lib/x86_64-linux-gnu/libboost_python-py27.so
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython2.7.so (found suitable version "2.7.12", minimum required is "2")
-- Found NumPy: /home/dev/.local/lib/python2.7/site-packages/numpy/core/include (found suitable version "1.15.1", minimum required is "1.7.1")
-- NumPy ver. 1.15.1 found (include: /home/dev/.local/lib/python2.7/site-packages/numpy/core/include)
-- Could NOT find Doxygen (missing:  DOXYGEN_EXECUTABLE)
-- Found NCCL: /usr/local/cuda/include
-- Found NCCL (include: /usr/local/cuda/include, library: /usr/local/cuda/lib/libnccl.so)
-- Found NVML: /usr/local/cuda/include
-- Found NVML (include: /usr/local/cuda/include, library: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so)
-- Found Git: /usr/bin/git (found version "2.7.4")
--
-- ******************* Caffe Configuration Summary *******************
-- General:
--   Version           :   0.17.2
--   Git               :   unknown
--   System            :   Linux
--   C++ compiler      :   /usr/bin/c++
--   Release CXX flags :   -O3 -DNDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
--   Debug CXX flags   :   -g -DDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
--   Build type        :   Release
--
--   BUILD_SHARED_LIBS :   ON
--   BUILD_python      :   ON
--   BUILD_matlab      :   OFF
--   BUILD_docs        :   ON
--   USE_LEVELDB       :   ON
--   USE_LMDB          :   ON
--   TEST_FP16         :   OFF
--
-- Dependencies:
--   BLAS              :   Yes (Open)
--   Boost             :   Yes (ver. 1.58)
--   glog              :   Yes
--   gflags            :   Yes
--   protobuf          :   Yes (ver. 3.5.0)
--   lmdb              :   Yes (ver. 0.9.17)
--   LevelDB           :   Yes (ver. 1.18)
--   Snappy            :   Yes (ver. 1.1.3)
--   OpenCV            :   Yes (ver. 2.4.9.1)
--   JPEGTurbo         :   Yes
--   CUDA              :   Yes (ver. 9.0)
--
-- NVIDIA CUDA:
--   Target GPU(s)     :   Auto
--   GPU arch(s)       :   sm_70
--   cuDNN             :   Yes (ver. 7.2)
--   NCCL              :   Yes (ver. ..)
--   USE_MPI           :   OFF
--   NVML              :   /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
--
-- Python:
--   Interpreter       :   /usr/bin/python2 (ver. 2.7.12)
--   Libraries         :   /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.12)
--   NumPy             :   /home/dev/.local/lib/python2.7/site-packages/numpy/core/include (ver 1.15.1)
--
-- Documentaion:
--   Doxygen           :   No
--   config_file       :
--
-- Install:
--   Install path      :   /home/dev/caffe-0.17.2/build/install
--
-- Configuring done
-- Generating done
-- Build files have been written to: /home/dev/caffe-0.17.2/build

And I also checked the ncclResult_tandncclGroupStart, it is the same output as you:

~$ grep 'ncclResult_t ncclGroupStart' /usr/include/nccl.h
ncclResult_t ncclGroupStart();
drnikolaev commented 5 years ago

@becauseofAI seems like you have another NCCL installed here:

-- Found NCCL: /usr/local/cuda/include
-- Found NCCL (include: /usr/local/cuda/include, library: /usr/local/cuda/lib/libnccl.so)

So, you might need to clean it first

becauseofAI commented 5 years ago

@drnikolaev I cleaned it up as you said. Now it works with CMake. But it still fails with make through Makefile.config. CMake successful details:

make clean
caffe-0.17.2$ mkdir build
caffe-0.17.2$ cd build/
caffe-0.17.2/build$ cmake ..
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Boost version: 1.58.0
-- Found the following Boost libraries:
--   system
--   thread
--   filesystem
--   regex
--   chrono
--   date_time
--   atomic
-- Found GFlags: /usr/include
-- Found gflags  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libgflags.so)
-- Found Glog: /usr/include
-- Found glog    (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libglog.so)
-- Found Protobuf: /usr/local/lib/libprotobuf.a
-- Found PROTOBUF Compiler: /usr/local/bin/protoc
-- Found HDF5: /usr/lib/x86_64-linux-gnu/hdf5/serial/lib/libhdf5_hl.so;/usr/lib/x86_64-linux-gnu/hdf5/serial/lib/libhdf5.so;/usr/lib/x86_64-linux-gnu/libpthread.so;/usr/lib/x86_64-linux-gnu/libsz.so;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libdl.so;/usr/lib/x86_64-linux-gnu/libm.so (found version "1.8.16")
-- Found LMDB: /usr/include
-- Found lmdb    (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/liblmdb.so)
-- Found LevelDB: /usr/include
-- Found LevelDB (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libleveldb.so)
-- Found Snappy: /usr/include
-- Found Snappy  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libsnappy.so)
-- Found JPEGTurbo: /usr/include
-- Found JPEGTurbo  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libturbojpeg.so.0)
-- CUDA detected: 9.0
-- Found CUDNN: /usr/local/cuda/lib64/libcudnn.so (found version "7.2")
-- Added CUDA NVCC flags for: sm_70
-- Found OpenCV 2.x: /usr/share/OpenCV
-- Found OpenBLAS libraries: /usr/lib/libopenblas.so
-- Found OpenBLAS include: /usr/include
-- Found PythonInterp: /usr/bin/python2 (found suitable version "2.7.12", minimum required is "2")
-- Found Boost Python Library /usr/lib/x86_64-linux-gnu/libboost_python-py27.so
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython2.7.so (found suitable version "2.7.12", minimum required is "2")
-- Found NumPy: /home/dev/.local/lib/python2.7/site-packages/numpy/core/include (found suitable version "1.15.1", minimum required is "1.7.1")
-- NumPy ver. 1.15.1 found (include: /home/dev/.local/lib/python2.7/site-packages/numpy/core/include)
-- Could NOT find Doxygen (missing:  DOXYGEN_EXECUTABLE)
-- Found NCCL: /usr/include
-- Found NCCL (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libnccl.so)
-- Found NVML: /usr/local/cuda/include
-- Found NVML (include: /usr/local/cuda/include, library: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so)
-- Found Git: /usr/bin/git (found version "2.7.4")
--
-- ******************* Caffe Configuration Summary *******************
-- General:
--   Version           :   0.17.2
--   Git               :   unknown
--   System            :   Linux
--   C++ compiler      :   /usr/bin/c++
--   Release CXX flags :   -O3 -DNDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
--   Debug CXX flags   :   -g -DDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
--   Build type        :   Release
--
--   BUILD_SHARED_LIBS :   ON
--   BUILD_python      :   ON
--   BUILD_matlab      :   OFF
--   BUILD_docs        :   ON
--   USE_LEVELDB       :   ON
--   USE_LMDB          :   ON
--   TEST_FP16         :   OFF
--
-- Dependencies:
--   BLAS              :   Yes (Open)
--   Boost             :   Yes (ver. 1.58)
--   glog              :   Yes
--   gflags            :   Yes
--   protobuf          :   Yes (ver. 3.5.0)
--   lmdb              :   Yes (ver. 0.9.17)
--   LevelDB           :   Yes (ver. 1.18)
--   Snappy            :   Yes (ver. 1.1.3)
--   OpenCV            :   Yes (ver. 2.4.9.1)
--   JPEGTurbo         :   Yes
--   CUDA              :   Yes (ver. 9.0)
--
-- NVIDIA CUDA:
--   Target GPU(s)     :   Auto
--   GPU arch(s)       :   sm_70
--   cuDNN             :   Yes (ver. 7.2)
--   NCCL              :   Yes (ver. 2.3.7)
--   USE_MPI           :   OFF
--   NVML              :   /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
--
-- Python:
--   Interpreter       :   /usr/bin/python2 (ver. 2.7.12)
--   Libraries         :   /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.12)
--   NumPy             :   /home/dev/.local/lib/python2.7/site-packages/numpy/core/include (ver 1.15.1)
--
-- Documentaion:
--   Doxygen           :   No
--   config_file       :
--
-- Install:
--   Install path      :   /home/dev/caffe-0.17.2/build/install
--
-- Configuring done
-- Generating done
-- Build files have been written to: /home/dev/caffe-0.17.2/build
caffe-0.17.2/build$ make all -j64
[  0%] Running C++/Python protocol buffer compiler on /home/wangyang/workspace/framework/caffe-0.17.2/src/caffe/proto/caffe.proto
Scanning dependencies of target proto
[  0%] Building CXX object src/caffe/CMakeFiles/proto.dir/__/__/include/caffe/proto/caffe.pb.cc.o
[  1%] Linking CXX static library ../../lib/libproto.a
[  1%] Built target proto
[  1%] Building NVCC (Device) object src/caffe/CMakeFiles/cuda_compile.dir/solvers/cuda_compile_generated_nesterov_solver.cu.o
[  1%] Building NVCC (Device) object src/caffe/CMakeFiles/cuda_compile.dir/solvers/cuda_compile_generated_adam_solver.cu.o
[  1%] Building NVCC (Device) object src/caffe/CMakeFiles/cuda_compile.dir/solvers/cuda_compile_generated_adadelta_solver.cu.o
[  1%] Building NVCC (Device) object src/caffe/CMakeFiles/cuda_compile.dir/util/cuda_compile_generated_gpu_amax.cu.o
[  1%] Building NVCC (Device) object src/caffe/CMakeFiles/cuda_compile.dir/util/cuda_compile_generated_gpu_asum.cu.o
[  1%] Building NVCC (Device) object src/caffe/CMakeFiles/cuda_compile.dir/solvers/cuda_compile_generated_sgd_solver.cu.o
[  1%] Building NVCC (Device) object src/caffe/CMakeFiles/cuda_compile.dir/layers/cuda_compile_generated_permute_layer.cu.o
[  1%] Building NVCC (Device) object src/caffe/CMakeFiles/cuda_compile.dir/solvers/cuda_compile_generated_sag_solver.cu.o
[  2%] Building NVCC (Device) object src/caffe/CMakeFiles/cuda_compile.dir/util/cuda_compile_generated_math_functions.cu.o
[  3%] Building NVCC (Device) object src/caffe/CMakeFiles/cuda_compile.dir/solvers/cuda_compile_generated_adagrad_solver.cu.o
... ...
[100%] Linking CXX executable cpp_classification/classification
[100%] Built target upgrade_net_proto_text
[100%] Built target upgrade_solver_proto_text
[100%] Linking CXX executable extract_features
[100%] Built target convert_annoset
[100%] Built target ssd_detect
[100%] Built target classification
[100%] Built target extract_features
[100%] Linking CXX executable caffe
[100%] Built target caffe.bin
[100%] Linking CXX shared library ../lib/_caffe.so
Creating symlink /home/dev/caffe-0.17.2/python/caffe/_caffe.so -> /home/dev/caffe-0.17.2/build/lib/_caffe.so
[100%] Built target pycaffe

Make failed details through Makefile.config:

caffe-0.17.2$ make clean
caffe-0.17.2$ make all -j32
PROTOC src/caffe/proto/caffe.proto
CXX src/caffe/internal_thread.cpp
CXX src/caffe/parallel.cpp
CXX src/caffe/layer_factory.cpp
... ...
NVCC src/caffe/layers/log_layer.cu
NVCC src/caffe/layers/permute_layer.cu
CXX tools/finetune_net.cpp
CXX tools/create_label_map.cpp
CXX tools/convert_imageset.cpp
CXX tools/upgrade_solver_proto_text.cpp
CXX tools/extract_features.cpp
CXX tools/compute_image_mean.cpp
CXX tools/upgrade_net_proto_text.cpp
CXX tools/test_net.cpp
CXX tools/caffe.cpp
CXX tools/train_net.cpp
CXX tools/net_speed_benchmark.cpp
CXX tools/convert_annoset.cpp
CXX tools/get_image_size.cpp
CXX tools/device_query.cpp
CXX tools/upgrade_net_proto_binary.cpp
CXX examples/siamese/convert_mnist_siamese_data.cpp
CXX examples/mnist/convert_mnist_data.cpp
CXX examples/ssd/ssd_detect.cpp
CXX examples/cpp_classification/classification.cpp
CXX examples/cifar10/convert_cifar_data.cpp
CXX .build_release/src/caffe/proto/caffe.pb.cc
AR -o .build_release/lib/libcaffe-nv.a
LD -o .build_release/lib/libcaffe-nv.so.0.17.2
CXX/LD -o .build_release/tools/finetune_net.bin
CXX/LD -o .build_release/tools/create_label_map.bin
CXX/LD -o .build_release/tools/convert_imageset.bin
CXX/LD -o .build_release/tools/upgrade_solver_proto_text.bin
CXX/LD -o .build_release/tools/extract_features.bin
CXX/LD -o .build_release/tools/compute_image_mean.bin
CXX/LD -o .build_release/tools/upgrade_net_proto_text.bin
CXX/LD -o .build_release/tools/test_net.bin
CXX/LD -o .build_release/tools/caffe.bin
CXX/LD -o .build_release/tools/train_net.bin
CXX/LD -o .build_release/tools/convert_annoset.bin
CXX/LD -o .build_release/tools/net_speed_benchmark.bin
CXX/LD -o .build_release/tools/get_image_size.bin
CXX/LD -o .build_release/tools/device_query.bin
CXX/LD -o .build_release/tools/upgrade_net_proto_binary.bin
CXX/LD -o .build_release/examples/siamese/convert_mnist_siamese_data.bin
CXX/LD -o .build_release/examples/mnist/convert_mnist_data.bin
CXX/LD -o .build_release/examples/ssd/ssd_detect.bin
CXX/LD -o .build_release/examples/cpp_classification/classification.bin
CXX/LD -o .build_release/examples/cifar10/convert_cifar_data.bin
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:654: recipe for target '.build_release/tools/upgrade_net_proto_text.bin' failed
make: *** [.build_release/tools/upgrade_net_proto_text.bin] Error 1
make: *** Waiting for unfinished jobs....
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:654: recipe for target '.build_release/tools/compute_image_mean.bin' failed
make: *** [.build_release/tools/compute_image_mean.bin] Error 1
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:654: recipe for target '.build_release/tools/upgrade_solver_proto_text.bin' failed
make: *** [.build_release/tools/upgrade_solver_proto_text.bin] Error 1
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:654: recipe for target '.build_release/tools/create_label_map.bin' failed
make: *** [.build_release/tools/create_label_map.bin] Error 1
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:659: recipe for target '.build_release/examples/mnist/convert_mnist_data.bin' failed
make: *** [.build_release/examples/mnist/convert_mnist_data.bin] Error 1
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:654: recipe for target '.build_release/tools/caffe.bin' failed
make: *** [.build_release/tools/caffe.bin] Error 1
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:654: recipe for target '.build_release/tools/upgrade_net_proto_binary.bin' failed
make: *** [.build_release/tools/upgrade_net_proto_binary.bin] Error 1
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:654: recipe for target '.build_release/tools/convert_imageset.bin' failed
make: *** [.build_release/tools/convert_imageset.bin] Error 1
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:654: recipe for target '.build_release/tools/extract_features.bin' failed
make: *** [.build_release/tools/extract_features.bin] Error 1
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:659: recipe for target '.build_release/examples/siamese/convert_mnist_siamese_data.bin' failed
make: *** [.build_release/examples/siamese/convert_mnist_siamese_data.bin] Error 1
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:654: recipe for target '.build_release/tools/get_image_size.bin' failed
make: *** [.build_release/tools/get_image_size.bin] Error 1
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:659: recipe for target '.build_release/examples/cifar10/convert_cifar_data.bin' failed
make: *** [.build_release/examples/cifar10/convert_cifar_data.bin] Error 1
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:654: recipe for target '.build_release/tools/convert_annoset.bin' failed
make: *** [.build_release/tools/convert_annoset.bin] Error 1
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:659: recipe for target '.build_release/examples/ssd/ssd_detect.bin' failed
make: *** [.build_release/examples/ssd/ssd_detect.bin] Error 1
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'
collect2: error: ld returned 1 exit status
Makefile:659: recipe for target '.build_release/examples/cpp_classification/classification.bin' failed
make: *** [.build_release/examples/cpp_classification/classification.bin] Error 1

Maybe I need to modify the configuration file of Makefile.config and Makefile? But how? The Makefile is original and not modified. The Makefile.config details:

## Refer to http://caffe.berkeleyvision.org/installation.html
# Contributions simplifying and improving our build system are welcome!

# cuDNN acceleration switch (uncomment to build with cuDNN).
# cuDNN version 6 or higher is required.
USE_CUDNN := 1

# NCCL acceleration switch (uncomment to build with NCCL)
# See https://github.com/NVIDIA/nccl
USE_NCCL := 1

# Builds tests with 16 bit float support in addition to 32 and 64 bit.
TEST_FP16 := 1

# uncomment to disable IO dependencies and corresponding data layers
# USE_OPENCV := 0
# USE_LEVELDB := 0
# USE_LMDB := 0

# Uncomment if you're using OpenCV 3
# OPENCV_VERSION := 3

# To customize your choice of compiler, uncomment and set the following.
# N.B. the default for Linux is g++ and the default for OSX is clang++
# CUSTOM_CXX := g++

# CUDA directory contains bin/ and lib/ directories that we need.
CUDA_DIR := /usr/local/cuda
# On Ubuntu 14.04, if cuda tools are installed via
# "sudo apt-get install nvidia-cuda-toolkit" then use this instead:
# CUDA_DIR := /usr

# CUDA architecture setting: going with all of them.
CUDA_ARCH :=    -gencode arch=compute_50,code=sm_50 \
        -gencode arch=compute_52,code=sm_52 \
        -gencode arch=compute_60,code=sm_60 \
        -gencode arch=compute_61,code=sm_61 \
        -gencode arch=compute_61,code=compute_61 \
        -gencode arch=compute_62,code=sm_62 \
        -gencode arch=compute_62,code=compute_62 \
        -gencode arch=compute_70,code=sm_70 \
        -gencode arch=compute_70,code=compute_70

# BLAS choice:
# atlas for ATLAS
# mkl for MKL
# open for OpenBlas - default, see https://github.com/xianyi/OpenBLAS
BLAS := open
# Custom (MKL/ATLAS/OpenBLAS) include and lib directories.
# BLAS_INCLUDE := /opt/OpenBLAS/include/
# BLAS_LIB := /opt/OpenBLAS/lib/

# Homebrew puts openblas in a directory that is not on the standard search path
# BLAS_INCLUDE := $(shell brew --prefix openblas)/include
# BLAS_LIB := $(shell brew --prefix openblas)/lib

# This is required only if you will compile the matlab interface.
# MATLAB directory should contain the mex binary in /bin.
# MATLAB_DIR := /usr/local
# MATLAB_DIR := /Applications/MATLAB_R2012b.app

# NOTE: this is required only if you will compile the python interface.
# We need to be able to find Python.h and numpy/arrayobject.h.
PYTHON_INCLUDE := /usr/include/python2.7 \
        /usr/lib/python2.7/dist-packages/numpy/core/include

# PYTHON_INCLUDE := /usr/include/python2.7 \
#       /usr/local/lib/python2.7/dist-packages/numpy/core/include

# Anaconda Python distribution is quite popular. Include path:
# Verify anaconda location, sometimes it's in root.
# ANACONDA_HOME := $(HOME)/anaconda
# PYTHON_INCLUDE := $(ANACONDA_HOME)/include \
        # $(ANACONDA_HOME)/include/python2.7 \
        # $(ANACONDA_HOME)/lib/python2.7/site-packages/numpy/core/include \

# Uncomment to use Python 3 (default is Python 2)
# PYTHON_LIBRARIES := boost_python3 python3.5m
# PYTHON_INCLUDE := /usr/include/python3.5m \
#                 /usr/lib/python3.5/dist-packages/numpy/core/include

# We need to be able to find libpythonX.X.so or .dylib.
PYTHON_LIB := /usr/lib
# PYTHON_LIB := $(ANACONDA_HOME)/lib

# Homebrew installs numpy in a non standard path (keg only)
# PYTHON_INCLUDE += $(dir $(shell python -c 'import numpy.core; print(numpy.core.__file__)'))/include
# PYTHON_LIB += $(shell brew --prefix numpy)/lib

# Uncomment to support layers written in Python (will link against Python libs)
WITH_PYTHON_LAYER := 1

# Whatever else you find you need goes here.
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /usr/include/hdf5/serial
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib /usr/lib/x86_64-linux-gnu/hdf5/serial

# If Homebrew is installed at a non standard location (for example your home directory) and you use it for general dependencies
# INCLUDE_DIRS += $(shell brew --prefix)/include
# LIBRARY_DIRS += $(shell brew --prefix)/lib

# Uncomment to use `pkg-config` to specify OpenCV library paths.
# (Usually not necessary -- OpenCV libraries are normally installed in one of the above $LIBRARY_DIRS.)
# USE_PKG_CONFIG := 1

BUILD_DIR := build
DISTRIBUTE_DIR := distribute

# Uncomment for debugging. Does not work on OSX due to https://github.com/BVLC/caffe/issues/171
# DEBUG := 1

# The ID of the GPU that 'make runtest' will use to run unit tests.
TEST_GPUID := 0

# enable pretty build (comment to see full commands)
Q ?= @

# shared object suffix name to differentiate branches
LIBRARY_NAME_SUFFIX := -nv
drnikolaev commented 5 years ago

@becauseofAI ok, cool. Now let's take a look at this linker error:

.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupEnd'
.build_release/lib/libcaffe-nv.so: undefined reference to `ncclGroupStart'

It means that linker still finds old NCCL library. Please try

$ find / -name 'libnccl*'

and remove old versions.

becauseofAI commented 5 years ago

@drnikolaev I‘m sorry for my carelessness, I forget to clean the /usr/local/cuda/lib64/libnccl.so. Now all is working well!:thumbsup: Thank you very much.:heart:

By the way, why is NCCL already in the cuda directory and why it can't be used, I am not very clear yet. Maybe it was compiled and installed previously by someone else through source code of old version:

cd nccl-master 
sudo make CUDA_HOME=/usr/local/cuda/ test # or /usr/local/cuda-9.0/

Or was it originally included in cuda9.0 but the version is old?

I'm not clear:confused:

drnikolaev commented 5 years ago

Seems like it was v1.0 built from sources. Good luck!