Open sde123 opened 7 years ago
@sde123 The mpi version is too old. To install a newer version, please download the source from here, then
tar xf openmpi-1.10.7.tar.gz
cd openmpi-1.10.7
./configure --with-cuda=/usr/local/cuda --enable-mpi-thread-multiple
make -j8
sudo make install
cd -
This will by default install it to /usr/local/
. To use it, please add the following line to your ~/.bashrc
export PATH=/usr/local/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
Restart the terminal, make clean
and make
to recompile it.
mpirun has detected an attempt to run as root. Running at root is strongly discouraged as any mistake (e.g., in defining TMPDIR) or bug can result in catastrophic damage to the OS file system, leaving your system in an unusable state.
Extracting train set E0918 21:25:28.585167 11730 extract_features.cpp:54] Using GPU E0918 21:25:28.585453 11730 extract_features.cpp:60] Using Device_id=0 F0918 21:25:28.854768 11730 io.cpp:52] Check failed: fd != -1 (-1 vs. -1) File not found: external/exp/snapshots/individually/prid_iter_11000.caffemodel Check failure stack trace: @ 0x7f641f7acdaa (unknown) @ 0x7f641f7acce4 (unknown) @ 0x7f641f7ac6e6 (unknown) @ 0x7f641f7af687 (unknown) @ 0x7f641fc142f4 caffe::ReadProtoFromBinaryFile() @ 0x7f641fc03d86 caffe::ReadNetParamsFromBinaryFileOrDie() @ 0x7f641fc2d5a7 caffe::Net<>::CopyTrainedLayersFromBinaryProto() @ 0x7f641fc2d616 caffe::Net<>::CopyTrainedLayersFrom() @ 0x40946f feature_extraction_pipeline<>() @ 0x7f641ebc8f45 (unknown) @ 0x40394e (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 11730 Aborted (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Extracting val set E0918 21:25:29.740474 11767 extract_features.cpp:54] Using GPU E0918 21:25:29.740756 11767 extract_features.cpp:60] Using Device_id=0 F0918 21:25:29.969832 11767 io.cpp:52] Check failed: fd != -1 (-1 vs. -1) File not found: external/exp/snapshots/individually/prid_iter_11000.caffemodel Check failure stack trace: @ 0x7fae36f21daa (unknown) @ 0x7fae36f21ce4 (unknown) @ 0x7fae36f216e6 (unknown) @ 0x7fae36f24687 (unknown) @ 0x7fae373892f4 caffe::ReadProtoFromBinaryFile() @ 0x7fae37378d86 caffe::ReadNetParamsFromBinaryFileOrDie() @ 0x7fae373a25a7 caffe::Net<>::CopyTrainedLayersFromBinaryProto() @ 0x7fae373a2616 caffe::Net<>::CopyTrainedLayersFrom() @ 0x40946f feature_extraction_pipeline<>() @ 0x7fae3633df45 (unknown) @ 0x40394e (unknown) @ (nil) (unknown)
@sde123 This means the training failed. Have you compiled the caffe successfully with the new openmpi?
@Cysu
I have make the caffe successdfully
the following Makefile.config:
USE_MPI := 1 MPI_INCLUDE := /usr/local/include/openmpi MPI_LIB := /usr/local/lib/openmpi
CUDA_DIR := /usr/local/cuda
CUDA_ARCH := -gencode arch=compute_20,code=sm_20 \ -gencode arch=compute_20,code=sm_21 \ -gencode arch=compute_30,code=sm_30 \ -gencode arch=compute_35,code=sm_35 \ -gencode arch=compute_50,code=sm_50 \ -gencode arch=compute_50,code=compute_50
BLAS := atlas
MATLAB_DIR := /usr/local/MATLAB/R2014a
PYTHON_INCLUDE := /usr/include/python2.7 \ /usr/lib/python2.7/dist-packages/numpy/core/include
# $(ANACONDA_HOME)/include/python2.7 \
# $(ANACONDA_HOME)/lib/python2.7/site-packages/numpy/core/include \
PYTHON_LIB := /usr/local/lib
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib
pkg-config
to specify OpenCV library paths.BUILD_DIR := build DISTRIBUTE_DIR := distribute
TEST_GPUID := 0
Q ?= @
@sde123 Yeah, I know the Makefile.config. I mean are there any errors when you run make
to compile the caffe?
A tip: when copy and paste something, please use the following syntax to wrap the content, for example,
``` Paste something here ```
This will make it much more readable.
Extracting train set E0919 07:41:33.413316 6608 extract_features.cpp:54] Using GPU E0919 07:41:33.413460 6608 extract_features.cpp:60] Using Device_id=0 F0919 07:41:33.640275 6608 io.cpp:52] Check failed: fd != -1 (-1 vs. -1) File not found: external/exp/snapshots/individually/prid_iter_11000.caffemodel ... Could you please tell me what is wrong? Thank you
@sde123 Ok, I assume you have compiled caffe successfully. Could you please comment out this line and run it again? It will reveal the true error.
mpirun noticed that process rank 1 with PID 7062 on node dai-System-Product-Name exited on signal 6 (Aborted). ...
Could you please check the outputs of the following commands
ldd external/caffe/build/tools/caffe | grep mpi
which mpi
mpirun --version
echo $LD_LIBRARY_PATH
Thank you,I check the commands and got the following information ... dai@dai-System-Product-Name:~/code/person_reidentification/3/dgd_person_reid$ ldd external/caffe/build/tools/caffe | grep mpi libmpi.so.12 => /usr/local/lib/libmpi.so.12 (0x00007f7258ddc000) libmpi_cxx.so.1 => /usr/local/lib/libmpi_cxx.so.1 (0x00007f7258bc2000)
dai@dai-System-Product-Name:~/code/person_reidentification/3/dgd_person_reid$ which mpi
dai@dai-System-Product-Name:~/code/person_reidentification/3/dgd_person_reid$ mpirun --version mpirun (Open MPI) 1.10.7
Report bugs to http://www.open-mpi.org/community/help/ dai@dai-System-Product-Name:~/code/person_reidentification/3/dgd_person_reid$ echo $LD_LIBRARY_PATH /usr/local/lib::/usr/local/lib ...
@Cysu I check the openmpi that is good,but when I run the code,I got the error ... mpirun noticed that process rank 1 with PID 7062 on node dai-System-Product-Name exited on signal 6 (Aborted). ... I don't know what is meaning,where is wrong?
@sde123 Sorry for my typo, it should be which mpirun
not which mpi
. Also what's the output of
ldd external/caffe/build/tools/caffe | grep mpi
Please use ``` (the tilde character left to number 1 on keyboard) instead of ... when paste the content. Thanks.
@Cysu Thank you for your advice I have check my openmpi or mpirun,I find that they are good.But I still have the following error when I run scripts/exp_individually.sh prid
I0919 11:36:00.268688 7328 net.cpp:290] Network initialization done.
I0919 11:36:00.268694 7328 net.cpp:291] Memory required for data: 62002712
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 7329 on node dai-System-Product-Name exited on signal 6 (Aborted).
--------------------------------------------------------------------------
I donnot know what is wrong in my case.
the following is my mpi:
dai@dai-System-Product-Name:~/code/person_reidentification/3/dgd_person_reid$ ldd external/caffe/build/tools/caffe | grep mpirun
dai@dai-System-Product-Name:~/code/person_reidentification/3/dgd_person_reid$ which mpirun
/usr/local/bin/mpirun
dai@dai-System-Product-Name:~/code/person_reidentification/3/dgd_person_reid$ mpirun --version
mpirun (Open MPI) 1.10.7
Report bugs to http://www.open-mpi.org/community/help/
dai@dai-System-Product-Name:~/code/person_reidentification/3/dgd_person_reid$ echo $LD_LIBRARY_PATH
/usr/local/lib::/usr/local/lib
Thank you
@sde123 The command should be
ldd external/caffe/build/tools/caffe | grep mpi
Note that it is grep mpi
not grep mpirun
.
By the way, what's your GPU memory size?
@Cysu hello I run the scripts/exp_individually.sh prid,and get a error:
mpirun noticed that process rank 1 with PID 20102 on node dai-System-Product-Name exited on signal 6 (Aborted).
So ,I make the caffe again,I got a measage when run make runtest -j8 in external/caffe:
[==========] 1488 tests from 222 test cases ran. (178224 ms total)
[ PASSED ] 1488 tests.
YOU HAVE 2 DISABLED TESTS
Did it mean that my external/caffe is wrong? Thankyou! My GPU is 32G
@Cysu because my mpi is always got wrong like above,how can I don not use mpi? will it have effects?
Could you please first check that the following commands report the same library paths?
ldd $(which mpirun) | grep mpi
ldd external/caffe/build/tools/caffe | grep mpi
@Cysu Thankyou verymuch but I input the "ldd $(which mpirun) | grep mpi" to the terminal ,I got no report. when I input "ldd external/caffe/build/tools/caffe | grep mpi" to the terminal ,I got the report:
libmpi.so.12 => /usr/local/lib/libmpi.so.12 (0x00007ff59f1a7000)
libmpi_cxx.so.1 => /usr/local/lib/libmpi_cxx.so.1 (0x00007ff59ef8c000)
why I got no report with inputting "ldd $(which mpirun) | grep mpi" to the terminal. I have install mpi ,and I can find it in the usr/local/lib and usr/local/include. could you please tell what is wrong in my case? Thank you very much!
Sorry, my fault. It should be
ldd $(which mpirun) | grep open-
ldd external/caffe/build/tools/caffe | grep open-
@Cysu Thank you I input the
ldd $(which mpirun) | grep open-
and I got the report:
libopen-rte.so.12 => /usr/local/lib/libopen-rte.so.12 (0x00007fafde806000)
libopen-pal.so.13 => /usr/local/lib/libopen-pal.so.13 (0x00007fafde528000)
I input the
ldd external/caffe/build/tools/caffe | grep open-
and got the report:
libopen-rte.so.12 => /usr/local/lib/libopen-rte.so.12 (0x00007f8b2bf05000)
libopen-pal.so.13 => /usr/local/lib/libopen-pal.so.13 (0x00007f8b2bc27000)
the two report are almost same did it mean that my mpi is right? but why I got the above error? Thank you very much!
Yes, that means the mpi are configured correctly. Could you please show the output of nvidia-smi
?
@Cysu Thankyou very much for your code on person re_identification I have install openmpi-1.6.5 But when I make -j8 in the external/caffe,I got a error:
CXX .build_release/src/caffe/proto/caffe.pb.cc CXX src/caffe/layers/data_layer.cpp CXX src/caffe/layers/multinomial_logistic_loss_layer.cpp CXX src/caffe/layers/loss_layer.cpp CXX src/caffe/layers/base_conv_layer.cpp CXX src/caffe/layers/memory_data_layer.cpp CXX src/caffe/layers/cudnn_relu_layer.cpp CXX src/caffe/layers/reshape_layer.cpp CXX src/caffe/layers/cudnn_softmax_layer.cpp CXX src/caffe/layers/threshold_layer.cpp CXX src/caffe/layers/sigmoid_layer.cpp CXX src/caffe/layers/argmax_layer.cpp CXX src/caffe/layers/hdf5_data_layer.cpp In file included from src/caffe/layers/data_layer.cpp:16:0: ./include/caffe/util/mpi_templates.hpp: In function ‘int MPIAllgather(int, const void, void, MPI_Comm) [with Dtype = float; MPI_Comm = ompi_communicator_t]’: ./include/caffe/util/mpi_templates.hpp:45:28: error: invalid conversion from ‘const void’ to ‘void’ [-fpermissive] comm); ^ In file included from ./include/caffe/util/mpi_templates.hpp:5:0, from src/caffe/layers/data_layer.cpp:16: /usr/local/include/mpi.h:1033:20: note: initializing argument 1 of ‘int MPI_Allgather(void, int, MPI_Datatype, void, int, MPI_Datatype, MPI_Comm)’ OMPI_DECLSPEC int MPI_Allgather(void sendbuf, int sendcount, MPI_Datatype sendtype, ^ In file included from src/caffe/layers/data_layer.cpp:16:0: ./include/caffe/util/mpi_templates.hpp: In function ‘int MPIAllgather(int, const void, void, MPI_Comm) [with Dtype = double; MPI_Comm = ompi_communicator_t]’: ./include/caffe/util/mpi_templates.hpp:51:28: error: invalid conversion from ‘const void’ to ‘void’ [-fpermissive] comm); ^ In file included from ./include/caffe/util/mpi_templates.hpp:5:0, from src/caffe/layers/data_layer.cpp:16: /usr/local/include/mpi.h:1033:20: note: initializing argument 1 of ‘int MPI_Allgather(void, int, MPI_Datatype, void, int, MPI_Datatype, MPI_Comm)’ OMPI_DECLSPEC int MPI_Allgather(void sendbuf, int sendcount, MPI_Datatype sendtype, ^ In file included from src/caffe/layers/data_layer.cpp:16:0: ./include/caffe/util/mpi_templates.hpp: In function ‘int MPIScatter(int, const void, void, int, MPI_Comm) [with Dtype = float; MPI_Comm = ompi_communicator_t]’: ./include/caffe/util/mpi_templates.hpp:61:17: error: invalid conversion from ‘const void’ to ‘void’ [-fpermissive] root, comm); ^ In file included from ./include/caffe/util/mpi_templates.hpp:5:0, from src/caffe/layers/data_layer.cpp:16: /usr/local/include/mpi.h:1375:20: note: initializing argument 1 of ‘int MPI_Scatter(void, int, MPI_Datatype, void, int, MPI_Datatype, int, MPI_Comm)’ OMPI_DECLSPEC int MPI_Scatter(void sendbuf, int sendcount, MPI_Datatype sendtype, ^ In file included from src/caffe/layers/data_layer.cpp:16:0: ./include/caffe/util/mpi_templates.hpp: In function ‘int MPIScatter(int, const void, void, int, MPI_Comm) [with Dtype = double; MPI_Comm = ompi_communicator_t]’: ./include/caffe/util/mpi_templates.hpp:67:17: error: invalid conversion from ‘const void’ to ‘void’ [-fpermissive] root, comm); ^ In file included from ./include/caffe/util/mpi_templates.hpp:5:0, from src/caffe/layers/data_layer.cpp:16: /usr/local/include/mpi.h:1375:20: note: initializing argument 1 of ‘int MPI_Scatter(void, int, MPI_Datatype, void, int, MPI_Datatype, int, MPI_Comm)’ OMPI_DECLSPEC int MPI_Scatter(void sendbuf, int sendcount, MPI_Datatype sendtype, ^ make: [.build_release/src/caffe/layers/data_layer.o] Error 1 make: Waiting for unfinished jobs....
The following is my Makefile.config:
Refer to http://caffe.berkeleyvision.org/installation.html
Contributions simplifying and improving our build system are welcome!
cuDNN acceleration switch (uncomment to build with cuDNN).
USE_CUDNN := 1
MPI data parallelization switch (uncomment to build with MPI).
USE_MPI := 1 MPI_INCLUDE := /usr/local/include MPI_LIB := /usr/local/lib
CPU-only switch (uncomment to build without GPU support).
CPU_ONLY := 1
To customize your choice of compiler, uncomment and set the following.
N.B. the default for Linux is g++ and the default for OSX is clang++
CUSTOM_CXX := g++
CUDA directory contains bin/ and lib/ directories that we need.
CUDA_DIR := /usr/local/cuda
On Ubuntu 14.04, if cuda tools are installed via
"sudo apt-get install nvidia-cuda-toolkit" then use this instead:
CUDA_DIR := /usr
CUDA architecture setting: going with all of them.
For CUDA < 6.0, comment the *_50 lines for compatibility.
CUDA_ARCH := -gencode arch=compute_20,code=sm_20 \ -gencode arch=compute_20,code=sm_21 \ -gencode arch=compute_30,code=sm_30 \ -gencode arch=compute_35,code=sm_35 \ -gencode arch=compute_50,code=sm_50 \ -gencode arch=compute_50,code=compute_50
BLAS choice:
atlas for ATLAS (default)
mkl for MKL
open for OpenBlas
BLAS := atlas
Custom (MKL/ATLAS/OpenBLAS) include and lib directories.
Leave commented to accept the defaults for your choice of BLAS
(which should work)!
BLAS_INCLUDE := /path/to/your/blas
BLAS_LIB := /path/to/your/blas
Homebrew puts openblas in a directory that is not on the standard search path
BLAS_INCLUDE := $(shell brew --prefix openblas)/include
BLAS_LIB := $(shell brew --prefix openblas)/lib
This is required only if you will compile the matlab interface.
MATLAB directory should contain the mex binary in /bin.
MATLAB_DIR := /usr/local/MATLAB/R2014a
MATLAB_DIR := /Applications/MATLAB_R2012b.app
NOTE: this is required only if you will compile the python interface.
We need to be able to find Python.h and numpy/arrayobject.h.
PYTHON_INCLUDE := /usr/include/python2.7 \ /usr/lib/python2.7/dist-packages/numpy/core/include
Anaconda Python distribution is quite popular. Include path:
Verify anaconda location, sometimes it's in root.
ANACONDA_HOME := $(HOME)/anaconda
PYTHON_INCLUDE := $(ANACONDA_HOME)/include \
We need to be able to find libpythonX.X.so or .dylib.
PYTHON_LIB := /usr/local/lib
PYTHON_LIB := $(ANACONDA_HOME)/lib
Homebrew installs numpy in a non standard path (keg only)
PYTHON_INCLUDE += $(dir $(shell python -c 'import numpy.core; print(numpy.core.file)'))/include
PYTHON_LIB += $(shell brew --prefix numpy)/lib
Uncomment to support layers written in Python (will link against Python libs)
WITH_PYTHON_LAYER := 1
Whatever else you find you need goes here.
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib
If Homebrew is installed at a non standard location (for example your home directory) and you use it for general dependencies
INCLUDE_DIRS += $(shell brew --prefix)/include
LIBRARY_DIRS += $(shell brew --prefix)/lib
Uncomment to use
pkg-config
to specify OpenCV library paths.(Usually not necessary -- OpenCV libraries are normally installed in one of the above $LIBRARY_DIRS.)
USE_PKG_CONFIG := 1
BUILD_DIR := build DISTRIBUTE_DIR := distribute
Uncomment for debugging. Does not work on OSX due to https://github.com/BVLC/caffe/issues/171
DEBUG := 1
The ID of the GPU that 'make runtest' will use to run unit tests.
TEST_GPUID := 0
enable pretty build (comment to see full commands)
Q ?= @
Could you please tell me how to solve it? Thank you very much!