ShuangLI59 / person_search

Joint Detection and Identification Feature Learning for Person Search
https://arxiv.org/abs/1604.01850
Other
738 stars 241 forks source link

compile the caffe #19

Closed Jcdidi closed 7 years ago

Jcdidi commented 7 years ago

hello,when I run the make -j8 && make install,it shows the error following:

[ 87%] [ 88%] make[2]: No rule to make target /path/to/cudnn/lib64/libcudnn.so', needed bylib/libcaffe.so'. Stop. make[2]: Waiting for unfinished jobs.... Building CXX object src/caffe/CMakeFiles/caffe.dir/data_transformer.cpp.o Building CXX object src/caffe/CMakeFiles/caffe.dir/syncedmem.cpp.o make[1]: [src/caffe/CMakeFiles/caffe.dir/all] Error 2 make: [all] Error 2

I wonder if it is the error of the path? Another question: Can I unuse the "cudnn"and "openmpi"?I only have one server but have 4 gpus,Iwonder if I can use the "-gpu al" to replace the "openmpi". thanks!

Cysu commented 7 years ago

Sorry but we modified the official caffe for our project. So we rely on openmpi if you want to use multiple GPUs (we only use multi-gpu for testing, but not training).

We also highly recommend you install cudnn v5. After downloading and extracting it, you need to replace the /path/to/cudnn in the cmake command with your own directory path. For example, if you copy the cudnn files to /usr/local/cuda, then the cmake command should be

cmake .. -DUSE_MPI=ON -DCUDNN_INCLUDE=/usr/local/cuda/include -DCUDNN_LIBRARY=/usr/local/cuda/lib64/libcudnn.so
Jcdidi commented 7 years ago

Thanks,But according to my environment which has only one server with 4 GPUs,can I use the openmpi?

Cysu commented 7 years ago

Sure. You can change these two lines to

mpirun -n 4 python2 tools/eval_test.py \
  --gpu 0,1,2,3 \
Jcdidi commented 7 years ago

um,thanks. "boost >= 1.55 (A tip for Ubuntu 14.04: sudo apt-get autoremove libboost1.54* then sudo apt-get install libboost1.55-all-dev)" it must be >=1.55? 

Cysu commented 7 years ago

Yes. It should be >= 1.55.

Jcdidi commented 7 years ago

xd@amax-1080:~/person_search-master$ experiments/scripts/eval_test.sh resnet50 50000 resnet50 [amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_paffinity_hwloc: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_carto_file: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_shmem_posix: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:00334] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_crs_none: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)

A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find.

Host: amax-1080 Framework: crs Component: none

[amax-1080:00334] Process received signal [amax-1080:00334] Signal: Segmentation fault (11) [amax-1080:00334] Signal code: Address not mapped (1) [amax-1080:00334] Failing at address: 0x28 [amax-1080:00334] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7f65b2b0a330] [amax-1080:00334] [ 1] /usr/lib/libmpi.so.1(mca_base_select+0x11e) [0x7f652bf16f1e] [amax-1080:00334] [ 2] /usr/lib/libmpi.so.1(opal_crs_base_select+0x7e) [0x7f652beff28e] [amax-1080:00334] [ 3] /usr/lib/libmpi.so.1(opal_cr_init+0x3fc) [0x7f652bf1ff1c] [amax-1080:00334] [ 4] /usr/lib/libmpi.so.1(opal_init+0x1d0) [0x7f652bf28810] [amax-1080:00334] [ 5] /usr/lib/libmpi.so.1(orte_init+0x37) [0x7f652beb86e7] [amax-1080:00334] [ 6] /usr/lib/libmpi.so.1(ompi_mpi_init+0x174) [0x7f652be78024] [amax-1080:00334] [ 7] /usr/lib/libmpi.so.1(PMPI_Init_thread+0xd4) [0x7f652be8f7f4] [amax-1080:00334] [ 8] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(initMPI+0x4716) [0x7f652c27d0a6] [amax-1080:00334] [ 9] python2(_PyImport_LoadDynamicModule+0x9b) [0x427992] [amax-1080:00334] [10] python2() [0x55642f] [amax-1080:00334] [11] python2() [0x4e2dec] [amax-1080:00334] [12] python2() [0x556cf1] [amax-1080:00334] [13] python2() [0x569c08] [amax-1080:00334] [14] python2(PyEval_CallObjectWithKeywords+0x6b) [0x4c8c8b] [amax-1080:00334] [15] python2(PyEval_EvalFrameEx+0x2958) [0x5264a8] [amax-1080:00334] [16] python2() [0x567d14] [amax-1080:00334] [17] python2(PyRun_FileExFlags+0x92) [0x465bf4] [amax-1080:00334] [18] python2(PyRun_SimpleFileExFlags+0x2ee) [0x46612d] [amax-1080:00334] [19] python2(Py_Main+0xb5e) [0x466d92] [amax-1080:00334] [20] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f65b2756f45] [amax-1080:00334] [21] python2() [0x577c2e] [amax-1080:00334] End of error message jxd@amax-1080:~/person_search-master$

I do not do the pretrain work,and directly use the trained model. As you say,I can do the test without MPI,so I do not use the MPI with "use only one GPU, remove the mpirun -n 8 in L14 and change L16 to --gpu 0",but it show the error above.How can I solve it,thanks. In addtion,when I use the MPI following what you advise,it also show the errors like this.

Cysu commented 7 years ago

It seems that you have different versions of openmpi. Let's say if you compile openmpi and install it into a local directory like /home/jxd/openmpi. Then please add the following lines in your ~/.bashrc:

export PATH=/home/jxd/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/home/jxd/openmpi/lib:$LD_LIBRARY_PATH

Restart the terminal, rm -rf build, and compile the caffe again.

Jcdidi commented 7 years ago

Hello,I have successfully installed the openmpi,and test it that it can be used.Then I cmake the caffe successfully,but I still exist the questions above.So I try to do the training,it meets the same questions. Thanks!

jxd@amax-1080:~/person_search-master$ experiments/scripts/train.sh 0 --set EXP_DIR resnet50

  • set -e
  • export PYTHONUNBUFFERED=True
  • PYTHONUNBUFFERED=True
  • GPU_ID=0
  • NET=resnet50
  • DATASET=psdb
  • array=($@)
  • len=4
  • EXTRA_ARGS='--set EXP_DIR resnet50'
  • EXTRA_ARGS_SLUG=--set_EXP_DIR_resnet50
  • case $DATASET in
  • TRAIN_IMDB=psdb_train
  • TEST_IMDB=psdb_test
  • PT_DIR=psdb
  • ITERS=50000 ++ date +%Y-%m-%d_%H-%M-%S
  • LOG=experiments/logs/psdb_trainresnet50--set_EXP_DIR_resnet50.txt.2017-03-08_08-49-53
  • exec ++ tee -a experiments/logs/psdb_trainresnet50--set_EXP_DIR_resnet50.txt.2017-03-08_08-49-53
  • echo Logging output to experiments/logs/psdb_trainresnet50--set_EXP_DIR_resnet50.txt.2017-03-08_08-49-53 Logging output to experiments/logs/psdb_trainresnet50--set_EXP_DIR_resnet50.txt.2017-03-08_08-49-53
  • python2 tools/train_net.py --gpu 0 --solver models/psdb/resnet50/solver.prototxt --weights data/imagenet_models/resnet50.caffemodel --imdb psdb_train --iters 50000 --cfg experiments/cfgs/resnet50.yml --rand --set EXP_DIR resnet50 [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_paffinity_hwloc: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_carto_auto_detect: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_carto_file: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_shmem_posix: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [amax-1080:22914] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_crs_none: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)

    A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find.

Host: amax-1080 Framework: crs Component: none

[amax-1080:22914] Process received signal [amax-1080:22914] Signal: Segmentation fault (11) [amax-1080:22914] Signal code: Address not mapped (1) [amax-1080:22914] Failing at address: 0x28 [amax-1080:22914] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10330) [0x7f0a35507330] [amax-1080:22914] [ 1] /usr/lib/libmpi.so.1(mca_base_select+0x11e) [0x7f09acb8bf1e] [amax-1080:22914] [ 2] /usr/lib/libmpi.so.1(opal_crs_base_select+0x7e) [0x7f09acb7428e] [amax-1080:22914] [ 3] /usr/lib/libmpi.so.1(opal_cr_init+0x3fc) [0x7f09acb94f1c] [amax-1080:22914] [ 4] /usr/lib/libmpi.so.1(opal_init+0x1d0) [0x7f09acb9d810] [amax-1080:22914] [ 5] /usr/lib/libmpi.so.1(orte_init+0x37) [0x7f09acb2d6e7] [amax-1080:22914] [ 6] /usr/lib/libmpi.so.1(ompi_mpi_init+0x174) [0x7f09acaed024] [amax-1080:22914] [ 7] /usr/lib/libmpi.so.1(PMPI_Init_thread+0xd4) [0x7f09acb047f4] [amax-1080:22914] [ 8] /usr/local/lib/python2.7/dist-packages/mpi4py/MPI.so(initMPI+0x4716) [0x7f09acef20a6] [amax-1080:22914] [ 9] python2(_PyImport_LoadDynamicModule+0x9b) [0x427992] [amax-1080:22914] [10] python2() [0x55642f] [amax-1080:22914] [11] python2() [0x4e2dec] [amax-1080:22914] [12] python2() [0x556cf1] [amax-1080:22914] [13] python2() [0x569c08] [amax-1080:22914] [14] python2(PyEval_CallObjectWithKeywords+0x6b) [0x4c8c8b] [amax-1080:22914] [15] python2(PyEval_EvalFrameEx+0x2958) [0x5264a8] [amax-1080:22914] [16] python2() [0x567d14] [amax-1080:22914] [17] python2(PyRun_FileExFlags+0x92) [0x465bf4] [amax-1080:22914] [18] python2(PyRun_SimpleFileExFlags+0x2ee) [0x46612d] [amax-1080:22914] [19] python2(Py_Main+0xb5e) [0x466d92] [amax-1080:22914] [20] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f0a35153f45] [amax-1080:22914] [21] python2() [0x577c2e] [amax-1080:22914] End of error message experiments/scripts/train.sh: line 47: 22914 Segmentation fault (core dumped) python2 tools/train_net.py --gpu ${GPU_ID} --solver models/${PT_DIR}/${NET}/solver.prototxt --weights data/imagenet_models/${NET}.caffemodel --imdb ${TRAIN_IMDB} --iters ${ITERS} --cfg experiments/cfgs/${NET}.yml --rand ${EXTRA_ARGS}

Cysu commented 7 years ago

Could you please check the output of the following commands:

which mpirun
ldd $(which mpirun) | grep mpi
ldd caffe/build/install/bin/caffe | grep mpi
Jcdidi commented 7 years ago

yeah,maybe I do not cmake caffe successfully as there's no information about it ?

ldd: caffe/build/install/bin/caffe: No such file or directory

jxd@amax-1080:~$ which mpirun /usr/local/openmpi/bin/mpirun jxd@amax-1080:~$ ldd $(which mpirun) | grep mpi libopen-rte.so.12 => /usr/local/openmpi/lib/libopen-rte.so.12 (0x00007f75c7edc000) libopen-pal.so.13 => /usr/local/openmpi/lib/libopen-pal.so.13 (0x00007f75c7bfe000) jxd@amax-1080:~$ ldd caffe/build/install/bin/caffe | grep mpi ldd: caffe/build/install/bin/caffe: No such file or directory

Cysu commented 7 years ago

OK. You have another self-compiled openmpi installed at /usr/local/openmpi. So you need to add these lines to ~/.bashrc:

export PATH=/usr/local/openmpi/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH

Restart the terminal, remove the build directory under caffe, and recompile it following the steps in the README file.

Jcdidi commented 7 years ago

Yes,I have added these lines to ~/.bashrc,and recompile it yesterday.Are there two openmpi installed in the system? Now I try to remove the build directory again and recompile it.Thanks

Cysu commented 7 years ago

Right. In your previous log, it complaints

mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_paffinity_hwloc: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)

So you have a system-installed openmpi at /usr/lib, and a self-installed /usr/local/openmpi.

Jcdidi commented 7 years ago

thanks a lot! I found the issue.I add the line to ~/.bashrc:

export LD_PRELOAD=/usr/local/openmpi/lib/libmpi.so

all detection: recall = 79.37% ap = 74.82% labeled only detection: recall = 97.76% search ranking: mAP = 75.41% top- 1 = 78.48% top- 5 = 90.07% top-10 = 92.34%

Cysu commented 7 years ago

Good to hear that! Will close the issue for now, and please feel free to reopen it if there are further problems.