Closed Jcdidi closed 7 years ago
Sorry but we modified the official caffe for our project. So we rely on openmpi if you want to use multiple GPUs (we only use multi-gpu for testing, but not training).
We also highly recommend you install cudnn v5. After downloading and extracting it, you need to replace the /path/to/cudnn
in the cmake command with your own directory path. For example, if you copy the cudnn files to /usr/local/cuda
, then the cmake command should be
cmake .. -DUSE_MPI=ON -DCUDNN_INCLUDE=/usr/local/cuda/include -DCUDNN_LIBRARY=/usr/local/cuda/lib64/libcudnn.so
Thanks,But according to my environment which has only one server with 4 GPUs,can I use the openmpi?
Sure. You can change these two lines to
mpirun -n 4 python2 tools/eval_test.py \
--gpu 0,1,2,3 \
um,thanks. "boost >= 1.55 (A tip for Ubuntu 14.04: sudo apt-get autoremove libboost1.54* then sudo apt-get install libboost1.55-all-dev)" it must be >=1.55?
Yes. It should be >= 1.55.
xd@amax-1080:~/person_search-master$ experiments/scripts/eval_test.sh resnet50 50000 resnet50 [amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_paffinity_hwloc: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_carto_file: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_shmem_posix: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_crs_none: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find.
[amax-1080:00334] Process received signal [amax-1080:00334] Signal: Segmentation fault (11) [amax-1080:00334] Signal code: Address not mapped (1) [amax-1080:00334] Failing at address: 0x28 [amax-1080:00334] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7f65b2b0a330] [amax-1080:00334] [ 1] /usr/lib/libmpi.so.1(mca_base_select+0x11e) [0x7f652bf16f1e] [amax-1080:00334] [ 2] /usr/lib/libmpi.so.1(opal_crs_base_select+0x7e) [0x7f652beff28e] [amax-1080:00334] [ 3] /usr/lib/libmpi.so.1(opal_cr_init+0x3fc) [0x7f652bf1ff1c] [amax-1080:00334] [ 4] /usr/lib/libmpi.so.1(opal_init+0x1d0) [0x7f652bf28810] [amax-1080:00334] [ 5] /usr/lib/libmpi.so.1(orte_init+0x37) [0x7f652beb86e7] [amax-1080:00334] [ 6] /usr/lib/libmpi.so.1(ompi_mpi_init+0x174) [0x7f652be78024] [amax-1080:00334] [ 7] /usr/lib/libmpi.so.1(PMPI_Init_thread+0xd4) [0x7f652be8f7f4] [amax-1080:00334] [ 8] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(initMPI+0x4716) [0x7f652c27d0a6] [amax-1080:00334] [ 9] python2(_PyImport_LoadDynamicModule+0x9b) [0x427992] [amax-1080:00334] [10] python2() [0x55642f] [amax-1080:00334] [11] python2() [0x4e2dec] [amax-1080:00334] [12] python2() [0x556cf1] [amax-1080:00334] [13] python2() [0x569c08] [amax-1080:00334] [14] python2(PyEval_CallObjectWithKeywords+0x6b) [0x4c8c8b] [amax-1080:00334] [15] python2(PyEval_EvalFrameEx+0x2958) [0x5264a8] [amax-1080:00334] [16] python2() [0x567d14] [amax-1080:00334] [17] python2(PyRun_FileExFlags+0x92) [0x465bf4] [amax-1080:00334] [18] python2(PyRun_SimpleFileExFlags+0x2ee) [0x46612d] [amax-1080:00334] [19] python2(Py_Main+0xb5e) [0x466d92] [amax-1080:00334] [20] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f65b2756f45] [amax-1080:00334] [21] python2() [0x577c2e] [amax-1080:00334] End of error message jxd@amax-1080:~/person_search-master$
I do not do the pretrain work,and directly use the trained model. As you say,I can do the test without MPI,so I do not use the MPI with "use only one GPU, remove the mpirun -n 8 in L14 and change L16 to --gpu 0",but it show the error above.How can I solve it,thanks. In addtion,when I use the MPI following what you advise,it also show the errors like this.
It seems that you have different versions of openmpi. Let's say if you compile openmpi and install it into a local directory like /home/jxd/openmpi
. Then please add the following lines in your ~/.bashrc
:
export PATH=/home/jxd/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/home/jxd/openmpi/lib:$LD_LIBRARY_PATH
Restart the terminal, rm -rf build
, and compile the caffe again.
Hello,I have successfully installed the openmpi,and test it that it can be used.Then I cmake the caffe successfully,but I still exist the questions above.So I try to do the training,it meets the same questions. Thanks!
jxd@amax-1080:~/person_search-master$ experiments/scripts/train.sh 0 --set EXP_DIR resnet50
- set -e
- export PYTHONUNBUFFERED=True
- PYTHONUNBUFFERED=True
- GPU_ID=0
- NET=resnet50
- DATASET=psdb
- array=($@)
- len=4
- EXTRA_ARGS='--set EXP_DIR resnet50'
- EXTRA_ARGS_SLUG=--set_EXP_DIR_resnet50
- case $DATASET in
- TRAIN_IMDB=psdb_train
- TEST_IMDB=psdb_test
- PT_DIR=psdb
- ITERS=50000 ++ date +%Y-%m-%d_%H-%M-%S
- LOG=experiments/logs/psdb_trainresnet50--set_EXP_DIR_resnet50.txt.2017-03-08_08-49-53
- exec ++ tee -a experiments/logs/psdb_trainresnet50--set_EXP_DIR_resnet50.txt.2017-03-08_08-49-53
- echo Logging output to experiments/logs/psdb_trainresnet50--set_EXP_DIR_resnet50.txt.2017-03-08_08-49-53 Logging output to experiments/logs/psdb_trainresnet50--set_EXP_DIR_resnet50.txt.2017-03-08_08-49-53
python2 tools/train_net.py --gpu 0 --solver models/psdb/resnet50/solver.prototxt --weights data/imagenet_models/resnet50.caffemodel --imdb psdb_train --iters 50000 --cfg experiments/cfgs/resnet50.yml --rand --set EXP_DIR resnet50 [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_paffinity_hwloc: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_carto_file: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_shmem_posix: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_crs_none: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find.
Could you please check the output of the following commands:
which mpirun
ldd $(which mpirun) | grep mpi
ldd caffe/build/install/bin/caffe | grep mpi
yeah,maybe I do not cmake caffe successfully as there's no information about it ?
ldd: caffe/build/install/bin/caffe: No such file or directory
jxd@amax-1080:~$ which mpirun /usr/local/openmpi/bin/mpirun jxd@amax-1080:~$ ldd $(which mpirun) | grep mpi libopen-rte.so.12 => /usr/local/openmpi/lib/libopen-rte.so.12 (0x00007f75c7edc000) libopen-pal.so.13 => /usr/local/openmpi/lib/libopen-pal.so.13 (0x00007f75c7bfe000) jxd@amax-1080:~$ ldd caffe/build/install/bin/caffe | grep mpi ldd: caffe/build/install/bin/caffe: No such file or directory
OK. You have another self-compiled openmpi installed at /usr/local/openmpi
. So you need to add these lines to ~/.bashrc
:
export PATH=/usr/local/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH
Restart the terminal, remove the build directory under caffe, and recompile it following the steps in the README file.
Yes,I have added these lines to ~/.bashrc,and recompile it yesterday.Are there two openmpi installed in the system? Now I try to remove the build directory again and recompile it.Thanks
Right. In your previous log, it complaints
mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_paffinity_hwloc: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
So you have a system-installed openmpi at /usr/lib
, and a self-installed /usr/local/openmpi
.
thanks a lot! I found the issue.I add the line to ~/.bashrc:
export LD_PRELOAD=/usr/local/openmpi/lib/libmpi.so
all detection: recall = 79.37% ap = 74.82% labeled only detection: recall = 97.76% search ranking: mAP = 75.41% top- 1 = 78.48% top- 5 = 90.07% top-10 = 92.34%
Good to hear that! Will close the issue for now, and please feel free to reopen it if there are further problems.
hello,when I run the make -j8 && make install,it shows the error following: