Cysu / dgd_person_reid

Domain Guided Dropout for Person Re-identification
http://arxiv.org/abs/1604.07528
231 stars 94 forks source link

about exp_individually.sh prid #17

Closed montego7878 closed 7 years ago

montego7878 commented 7 years ago

montego7878@montego7878-All-Series:~/下載/dgd_person_reid/scripts$ ./exp_individually.sh prid

[libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.SolverParameter: 16:7: Message type "caffe.SolverParameter" has no field named "min_lr". [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.SolverParameter: 16:7: Message type "caffe.SolverParameter" has no field named "min_lr". F1208 19:46:38.284919 17113 upgrade_proto.cpp:1095] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse SolverParameter file: models/individually/prid_solver.prototxt F1208 19:46:38.284919 17114 upgrade_proto.cpp:1095] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse SolverParameter file: models/individually/prid_solver.prototxt Check failure stack trace: Check failure stack trace: @ 0x7fbc2053edaa (unknown) @ 0x7fc90f42ddaa (unknown) @ 0x7fbc2053ece4 (unknown) @ 0x7fc90f42dce4 (unknown) @ 0x7fbc2053e6e6 (unknown) @ 0x7fc90f42d6e6 (unknown) @ 0x7fbc20541687 (unknown) @ 0x7fc90f430687 (unknown) @ 0x7fbc20cbcf3e caffe::ReadSolverParamsFromTextFileOrDie() @ 0x407f14 train() @ 0x405b3c main @ 0x7fc90fbabf3e caffe::ReadSolverParamsFromTextFileOrDie() @ 0x407f14 train() @ 0x405b3c main @ 0x7fbc1f54af45 (unknown) @ 0x4063ab (unknown) @ 0x7fc90e439f45 (unknown) @ 0x4063ab (unknown) @ (nil) (unknown) @ (nil) (unknown)

mpirun noticed that process rank 0 with PID 17113 on node montego7878-All-Series exited on signal 6 (Aborted).

2 total processes killed (some possibly by mpirun during cleanup) Extracting train set E1208 19:46:38.383947 17138 extract_features.cpp:52] Using GPU E1208 19:46:38.384215 17138 extract_features.cpp:58] Using Device_id=0 [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 7:46: Message type "caffe.TransformationParameter" has no field named "crop_height". F1208 19:46:38.587419 17138 upgrade_proto.cpp:88] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /tmp/tmp.S4fBxURPtd Check failure stack trace: @ 0x7fa7e766cdaa (unknown) @ 0x7fa7e766cce4 (unknown) @ 0x7fa7e766c6e6 (unknown) @ 0x7fa7e766f687 (unknown) @ 0x7fa7e7b8698e caffe::ReadNetParamsFromTextFileOrDie() @ 0x7fa7e7b6e61f caffe::Net<>::Net() @ 0x408cf3 feature_extraction_pipeline<>() @ 0x7fa7e6898f45 (unknown) @ 0x4048be (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 17138 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Extracting val set E1208 19:46:38.741158 17152 extract_features.cpp:52] Using GPU E1208 19:46:38.741426 17152 extract_features.cpp:58] Using Device_id=0 [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 7:46: Message type "caffe.TransformationParameter" has no field named "crop_height". F1208 19:46:38.930703 17152 upgrade_proto.cpp:88] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /tmp/tmp.KOZuJPVKch Check failure stack trace: @ 0x7fdb7dc05daa (unknown) @ 0x7fdb7dc05ce4 (unknown) @ 0x7fdb7dc056e6 (unknown) @ 0x7fdb7dc08687 (unknown) @ 0x7fdb7e11f98e caffe::ReadNetParamsFromTextFileOrDie() @ 0x7fdb7e10761f caffe::Net<>::Net() @ 0x408cf3 feature_extraction_pipeline<>() @ 0x7fdb7ce31f45 (unknown) @ 0x4048be (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 17152 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Extracting test_probe set E1208 19:46:39.085443 17166 extract_features.cpp:52] Using GPU E1208 19:46:39.085688 17166 extract_features.cpp:58] Using Device_id=0 [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 7:46: Message type "caffe.TransformationParameter" has no field named "crop_height". F1208 19:46:39.277931 17166 upgrade_proto.cpp:88] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /tmp/tmp.MnmAclhImz Check failure stack trace: @ 0x7f5029da6daa (unknown) @ 0x7f5029da6ce4 (unknown) @ 0x7f5029da66e6 (unknown) @ 0x7f5029da9687 (unknown) @ 0x7f502a2c098e caffe::ReadNetParamsFromTextFileOrDie() @ 0x7f502a2a861f caffe::Net<>::Net() @ 0x408cf3 feature_extraction_pipeline<>() @ 0x7f5028fd2f45 (unknown) @ 0x4048be (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 17166 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Extracting test_gallery set E1208 19:46:39.435348 17181 extract_features.cpp:52] Using GPU E1208 19:46:39.435585 17181 extract_features.cpp:58] Using Device_id=0 [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 7:46: Message type "caffe.TransformationParameter" has no field named "crop_height". F1208 19:46:39.623297 17181 upgrade_proto.cpp:88] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /tmp/tmp.oZi1WOYwfm Check failure stack trace: @ 0x7fac5ef6edaa (unknown) @ 0x7fac5ef6ece4 (unknown) @ 0x7fac5ef6e6e6 (unknown) @ 0x7fac5ef71687 (unknown) @ 0x7fac5f48898e caffe::ReadNetParamsFromTextFileOrDie() @ 0x7fac5f47061f caffe::Net<>::Net() @ 0x408cf3 feature_extraction_pipeline<>() @ 0x7fac5e19af45 (unknown) @ 0x4048be (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 17181 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Traceback (most recent call last): File "eval/metric_learning.py", line 104, in main(args) File "eval/metric_learning.py", line 73, in main X, Y = _get_train_data(args.result_dir) File "eval/metric_learning.py", line 11, in _get_traindata features = np.r[np.load(osp.join(result_dir, 'train_features.npy')), File "/home/montego7878/anaconda2/lib/python2.7/site-packages/numpy/lib/npyio.py", line 362, in load fid = open(file, "rb") IOError: [Errno 2] No such file or directory: 'external/exp/results/individually/prid_prid_iter_11000_fc7_bn/train_features.npy' Extracting train set E1208 19:46:39.992875 17211 extract_features.cpp:52] Using GPU E1208 19:46:39.993129 17211 extract_features.cpp:58] Using Device_id=0 [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 7:46: Message type "caffe.TransformationParameter" has no field named "crop_height". F1208 19:46:40.181699 17211 upgrade_proto.cpp:88] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /tmp/tmp.6rvyXNylk5 Check failure stack trace: @ 0x7f2c5fdb0daa (unknown) @ 0x7f2c5fdb0ce4 (unknown) @ 0x7f2c5fdb06e6 (unknown) @ 0x7f2c5fdb3687 (unknown) @ 0x7f2c602ca98e caffe::ReadNetParamsFromTextFileOrDie() @ 0x7f2c602b261f caffe::Net<>::Net() @ 0x408cf3 feature_extraction_pipeline<>() @ 0x7f2c5efdcf45 (unknown) @ 0x4048be (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 17211 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Extracting val set E1208 19:46:40.338579 17225 extract_features.cpp:52] Using GPU E1208 19:46:40.338822 17225 extract_features.cpp:58] Using Device_id=0 [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 7:46: Message type "caffe.TransformationParameter" has no field named "crop_height". F1208 19:46:40.527302 17225 upgrade_proto.cpp:88] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /tmp/tmp.2fKuobB8uG Check failure stack trace: @ 0x7f2b7aa3cdaa (unknown) @ 0x7f2b7aa3cce4 (unknown) @ 0x7f2b7aa3c6e6 (unknown) @ 0x7f2b7aa3f687 (unknown) @ 0x7f2b7af5698e caffe::ReadNetParamsFromTextFileOrDie() @ 0x7f2b7af3e61f caffe::Net<>::Net() @ 0x408cf3 feature_extraction_pipeline<>() @ 0x7f2b79c68f45 (unknown) @ 0x4048be (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 17225 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Extracting test_probe set E1208 19:46:40.683516 17239 extract_features.cpp:52] Using GPU E1208 19:46:40.683779 17239 extract_features.cpp:58] Using Device_id=0 [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 7:46: Message type "caffe.TransformationParameter" has no field named "crop_height". F1208 19:46:40.879796 17239 upgrade_proto.cpp:88] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /tmp/tmp.7Yctcxn0J1 Check failure stack trace: @ 0x7f19b94e7daa (unknown) @ 0x7f19b94e7ce4 (unknown) @ 0x7f19b94e76e6 (unknown) @ 0x7f19b94ea687 (unknown) @ 0x7f19b9a0198e caffe::ReadNetParamsFromTextFileOrDie() @ 0x7f19b99e961f caffe::Net<>::Net() @ 0x408cf3 feature_extraction_pipeline<>() @ 0x7f19b8713f45 (unknown) @ 0x4048be (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 17239 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Extracting test_gallery set E1208 19:46:41.036201 17253 extract_features.cpp:52] Using GPU E1208 19:46:41.036490 17253 extract_features.cpp:58] Using Device_id=0 [libprotobuf ERROR google/protobuf/text_format.cc:245] Error parsing text-format caffe.NetParameter: 7:46: Message type "caffe.TransformationParameter" has no field named "crop_height". F1208 19:46:41.226066 17253 upgrade_proto.cpp:88] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /tmp/tmp.yAlDCC7flD Check failure stack trace: @ 0x7f4ea9623daa (unknown) @ 0x7f4ea9623ce4 (unknown) @ 0x7f4ea96236e6 (unknown) @ 0x7f4ea9626687 (unknown) @ 0x7f4ea9b3d98e caffe::ReadNetParamsFromTextFileOrDie() @ 0x7f4ea9b2561f caffe::Net<>::Net() @ 0x408cf3 feature_extraction_pipeline<>() @ 0x7f4ea884ff45 (unknown) @ 0x4048be (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 17253 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Traceback (most recent call last): File "tools/convert_lmdb_to_numpy.py", line 2, in import lmdb ImportError: No module named lmdb Traceback (most recent call last): File "eval/metric_learning.py", line 104, in main(args) File "eval/metric_learning.py", line 73, in main X, Y = _get_train_data(args.result_dir) File "eval/metric_learning.py", line 11, in _get_traindata features = np.r[np.load(osp.join(result_dir, 'train_features.npy')), File "/home/montego7878/anaconda2/lib/python2.7/site-packages/numpy/lib/npyio.py", line 362, in load fid = open(file, "rb") IOError: [Errno 2] No such file or directory: 'external/exp/results/individually/prid_prid_iter_11000_fc7_bn/train_features.npy'

pingjun18-li commented 7 years ago

I also encountered the same problem as you, have you solved it?If you have solved, ask me how to do please!

Cysu commented 7 years ago

@montego7878 @pingjunLi Please clone this repo recursively with

git clone --recursive https://github.com/Cysu/dgd_person_reid.git

and then compile the external/caffe.

Also please install the lmdb python package with

pip install lmdb
pingjun18-li commented 7 years ago

thank you for your help,I have done as what you say,but there's still a problem,which i don't why. that is part of the running result. 1228 15:59:34.480654 20776 layer_factory.hpp:74] Creating layer fc7_bn I1228 15:59:34.480659 20776 layer_factory.cpp:191] Layer fc7_bn is using CAFFE engine. I1228 15:59:34.480666 20776 net.cpp:133] Creating Layer fc7_bn I1228 15:59:34.480671 20776 net.cpp:453] fc7_bn <- fc7 I1228 15:59:34.480681 20776 net.cpp:411] fc7_bn -> fc7_bn I1228 15:59:34.480690 20776 net.cpp:163] Setting up fc7_bn I1228 15:59:34.480710 20776 net.cpp:170] Top shape: 20 256 (5120) I1228 15:59:34.480718 20776 layer_factory.hpp:74] Creating layer relu7 I1228 15:59:34.480726 20776 net.cpp:133] Creating Layer relu7 I1228 15:59:34.480729 20776 net.cpp:453] relu7 <- fc7_bn I1228 15:59:34.480741 20776 net.cpp:400] relu7 -> fc7_bn (in-place) I1228 15:59:34.480746 20776 net.cpp:163] Setting up relu7 I1228 15:59:34.480753 20776 net.cpp:170] Top shape: 20 256 (5120) I1228 15:59:34.480775 20776 layer_factory.hpp:74] Creating layer drop7 I1228 15:59:34.480787 20776 net.cpp:133] Creating Layer drop7 I1228 15:59:34.480790 20776 net.cpp:453] drop7 <- fc7_bn I1228 15:59:34.480798 20776 net.cpp:400] drop7 -> fc7_bn (in-place) I1228 15:59:34.480804 20776 net.cpp:163] Setting up drop7 I1228 15:59:34.480813 20776 net.cpp:170] Top shape: 20 256 (5120) I1228 15:59:34.480818 20776 layer_factory.hpp:74] Creating layer fc8_prid I1228 15:59:34.480828 20776 net.cpp:133] Creating Layer fc8_prid I1228 15:59:34.480832 20776 net.cpp:453] fc8_prid <- fc7_bn I1228 15:59:34.480839 20776 net.cpp:411] fc8_prid -> fc8_prid I1228 15:59:34.480849 20776 net.cpp:163] Setting up fc8_prid I1228 15:59:34.485260 20776 net.cpp:170] Top shape: 20 385 (7700) I1228 15:59:34.485273 20776 layer_factory.hpp:74] Creating layer fc8_prid_fc8_prid_0_split I1228 15:59:34.485286 20776 net.cpp:133] Creating Layer fc8_prid_fc8_prid_0_split I1228 15:59:34.485292 20776 net.cpp:453] fc8_prid_fc8_prid_0_split <- fc8_prid I1228 15:59:34.485301 20776 net.cpp:411] fc8_prid_fc8_prid_0_split -> fc8_prid_fc8_prid_0_split_0 I1228 15:59:34.485311 20776 net.cpp:411] fc8_prid_fc8_prid_0_split -> fc8_prid_fc8_prid_0_split_1 I1228 15:59:34.485317 20776 net.cpp:163] Setting up fc8_prid_fc8_prid_0_split I1228 15:59:34.485326 20776 net.cpp:170] Top shape: 20 385 (7700) I1228 15:59:34.485332 20776 net.cpp:170] Top shape: 20 385 (7700) I1228 15:59:34.485337 20776 layer_factory.hpp:74] Creating layer loss I1228 15:59:34.485343 20776 net.cpp:133] Creating Layer loss I1228 15:59:34.485352 20776 net.cpp:453] loss <- fc8_prid_fc8_prid_0_split_0 I1228 15:59:34.485358 20776 net.cpp:453] loss <- label_data_1_split_0 I1228 15:59:34.485365 20776 net.cpp:411] loss -> loss I1228 15:59:34.485373 20776 net.cpp:163] Setting up loss I1228 15:59:34.485380 20776 layer_factory.hpp:74] Creating layer loss I1228 15:59:34.485415 20776 net.cpp:170] Top shape: (1) I1228 15:59:34.485420 20776 net.cpp:172] with loss weight 1 I1228 15:59:34.485436 20776 layer_factory.hpp:74] Creating layer accuracy I1228 15:59:34.485452 20776 net.cpp:133] Creating Layer accuracy I1228 15:59:34.485457 20776 net.cpp:453] accuracy <- fc8_prid_fc8_prid_0_split_1 I1228 15:59:34.485463 20776 net.cpp:453] accuracy <- label_data_1_split_1 I1228 15:59:34.485471 20776 net.cpp:411] accuracy -> accuracy I1228 15:59:34.485478 20776 net.cpp:163] Setting up accuracy I1228 15:59:34.485486 20776 net.cpp:170] Top shape: (1) I1228 15:59:34.485491 20776 net.cpp:237] accuracy does not need backward computation. I1228 15:59:34.485496 20776 net.cpp:235] loss needs backward computation. I1228 15:59:34.485502 20776 net.cpp:235] fc8_prid_fc8_prid_0_split needs backward computation. I1228 15:59:34.485507 20776 net.cpp:235] fc8_prid needs backward computation. I1228 15:59:34.485510 20776 net.cpp:235] drop7 needs backward computation. I1228 15:59:34.485515 20776 net.cpp:235] relu7 needs backward computation. I1228 15:59:34.485519 20776 net.cpp:235] fc7_bn needs backward computation. I1228 15:59:34.485523 20776 net.cpp:235] fc7 needs backward computation. I1228 15:59:34.485528 20776 net.cpp:235] global_pool needs backward computation. I1228 15:59:34.485532 20776 net.cpp:235] inception_3b/output needs backward computation. I1228 15:59:34.485538 20776 net.cpp:235] inception_3b/pool needs backward computation. I1228 15:59:34.485546 20776 net.cpp:235] inception_3b/relu_double_3x3_2 needs backward computation. I1228 15:59:34.485551 20776 net.cpp:235] inception_3b/double_3x3_2_bn needs backward computation. I1228 15:59:34.485556 20776 net.cpp:235] inception_3b/double_3x3_2 needs backward computation. I1228 15:59:34.485561 20776 net.cpp:235] inception_3b/relu_double_3x3_1 needs backward computation. I1228 15:59:34.485565 20776 net.cpp:235] inception_3b/double_3x3_1_bn needs backward computation. I1228 15:59:34.485570 20776 net.cpp:235] inception_3b/double_3x3_1 needs backward computation. I1228 15:59:34.485575 20776 net.cpp:235] inception_3b/relu_double_3x3_reduce needs backward computation. I1228 15:59:34.485579 20776 net.cpp:235] inception_3b/double_3x3_reduce_bn needs backward computation. I1228 15:59:34.485584 20776 net.cpp:235] inception_3b/double_3x3_reduce needs backward computation. I1228 15:59:34.485589 20776 net.cpp:235] inception_3b/relu_3x3 needs backward computation. I1228 15:59:34.485594 20776 net.cpp:235] inception_3b/3x3_bn needs backward computation. I1228 15:59:34.485599 20776 net.cpp:235] inception_3b/3x3 needs backward computation. I1228 15:59:34.485604 20776 net.cpp:235] inception_3b/relu_3x3_reduce needs backward computation. I1228 15:59:34.485608 20776 net.cpp:235] inception_3b/3x3_reduce_bn needs backward computation. I1228 15:59:34.485615 20776 net.cpp:235] inception_3b/3x3_reduce needs backward computation. I1228 15:59:34.485620 20776 net.cpp:235] inception_3a/output_inception_3a/output_0_split needs backward computation. I1228 15:59:34.485625 20776 net.cpp:235] inception_3a/output needs backward computation. I1228 15:59:34.485631 20776 net.cpp:235] inception_3a/relu_pool_proj needs backward computation. I1228 15:59:34.485635 20776 net.cpp:235] inception_3a/pool_proj_bn needs backward computation. I1228 15:59:34.485641 20776 net.cpp:235] inception_3a/pool_proj needs backward computation. I1228 15:59:34.485646 20776 net.cpp:235] inception_3a/pool needs backward computation. I1228 15:59:34.485651 20776 net.cpp:235] inception_3a/relu_double_3x3_2 needs backward computation. I1228 15:59:34.485656 20776 net.cpp:235] inception_3a/double_3x3_2_bn needs backward computation. I1228 15:59:34.485661 20776 net.cpp:235] inception_3a/double_3x3_2 needs backward computation. I1228 15:59:34.485666 20776 net.cpp:235] inception_3a/relu_double_3x3_1 needs backward computation. I1228 15:59:34.485669 20776 net.cpp:235] inception_3a/double_3x3_1_bn needs backward computation. I1228 15:59:34.485674 20776 net.cpp:235] inception_3a/double_3x3_1 needs backward computation. I1228 15:59:34.485679 20776 net.cpp:235] inception_3a/relu_double_3x3_reduce needs backward computation. I1228 15:59:34.485683 20776 net.cpp:235] inception_3a/double_3x3_reduce_bn needs backward computation. I1228 15:59:34.485688 20776 net.cpp:235] inception_3a/double_3x3_reduce needs backward computation. I1228 15:59:34.485694 20776 net.cpp:235] inception_3a/relu_3x3 needs backward computation. I1228 15:59:34.485698 20776 net.cpp:235] inception_3a/3x3_bn needs backward computation. I1228 15:59:34.485704 20776 net.cpp:235] inception_3a/3x3 needs backward computation. I1228 15:59:34.485709 20776 net.cpp:235] inception_3a/relu_3x3_reduce needs backward computation. I1228 15:59:34.485713 20776 net.cpp:235] inception_3a/3x3_reduce_bn needs backward computation. I1228 15:59:34.485719 20776 net.cpp:235] inception_3a/3x3_reduce needs backward computation. I1228 15:59:34.485724 20776 net.cpp:235] inception_3a/relu_1x1 needs backward computation. I1228 15:59:34.485728 20776 net.cpp:235] inception_3a/1x1_bn needs backward computation. I1228 15:59:34.485733 20776 net.cpp:235] inception_3a/1x1 needs backward computation. I1228 15:59:34.485738 20776 net.cpp:235] inception_2b/output_inception_2b/output_0_split needs backward computation. I1228 15:59:34.485743 20776 net.cpp:235] inception_2b/output needs backward computation. I1228 15:59:34.485750 20776 net.cpp:235] inception_2b/pool needs backward computation. I1228 15:59:34.485755 20776 net.cpp:235] inception_2b/relu_double_3x3_2 needs backward computation. I1228 15:59:34.485760 20776 net.cpp:235] inception_2b/double_3x3_2_bn needs backward computation. I1228 15:59:34.485765 20776 net.cpp:235] inception_2b/double_3x3_2 needs backward computation. I1228 15:59:34.485770 20776 net.cpp:235] inception_2b/relu_double_3x3_1 needs backward computation. I1228 15:59:34.485775 20776 net.cpp:235] inception_2b/double_3x3_1_bn needs backward computation. I1228 15:59:34.485780 20776 net.cpp:235] inception_2b/double_3x3_1 needs backward computation. I1228 15:59:34.485785 20776 net.cpp:235] inception_2b/relu_double_3x3_reduce needs backward computation. I1228 15:59:34.485790 20776 net.cpp:235] inception_2b/double_3x3_reduce_bn needs backward computation. I1228 15:59:34.485795 20776 net.cpp:235] inception_2b/double_3x3_reduce needs backward computation. I1228 15:59:34.485800 20776 net.cpp:235] inception_2b/relu_3x3 needs backward computation. I1228 15:59:34.485805 20776 net.cpp:235] inception_2b/3x3_bn needs backward computation. I1228 15:59:34.485810 20776 net.cpp:235] inception_2b/3x3 needs backward computation. I1228 15:59:34.485815 20776 net.cpp:235] inception_2b/relu_3x3_reduce needs backward computation. I1228 15:59:34.485819 20776 net.cpp:235] inception_2b/3x3_reduce_bn needs backward computation. I1228 15:59:34.485824 20776 net.cpp:235] inception_2b/3x3_reduce needs backward computation. I1228 15:59:34.485829 20776 net.cpp:235] inception_2a/output_inception_2a/output_0_split needs backward computation. I1228 15:59:34.485836 20776 net.cpp:235] inception_2a/output needs backward computation. I1228 15:59:34.485842 20776 net.cpp:235] inception_2a/relu_pool_proj needs backward computation. I1228 15:59:34.485847 20776 net.cpp:235] inception_2a/pool_proj_bn needs backward computation. I1228 15:59:34.485852 20776 net.cpp:235] inception_2a/pool_proj needs backward computation. I1228 15:59:34.485857 20776 net.cpp:235] inception_2a/pool needs backward computation. I1228 15:59:34.485864 20776 net.cpp:235] inception_2a/relu_double_3x3_2 needs backward computation. I1228 15:59:34.485869 20776 net.cpp:235] inception_2a/double_3x3_2_bn needs backward computation. I1228 15:59:34.485874 20776 net.cpp:235] inception_2a/double_3x3_2 needs backward computation. I1228 15:59:34.485879 20776 net.cpp:235] inception_2a/relu_double_3x3_1 needs backward computation. I1228 15:59:34.485884 20776 net.cpp:235] inception_2a/double_3x3_1_bn needs backward computation. I1228 15:59:34.485889 20776 net.cpp:235] inception_2a/double_3x3_1 needs backward computation. I1228 15:59:34.485894 20776 net.cpp:235] inception_2a/relu_double_3x3_reduce needs backward computation. I1228 15:59:34.485898 20776 net.cpp:235] inception_2a/double_3x3_reduce_bn needs backward computation. I1228 15:59:34.485903 20776 net.cpp:235] inception_2a/double_3x3_reduce needs backward computation. I1228 15:59:34.485908 20776 net.cpp:235] inception_2a/relu_3x3 needs backward computation. I1228 15:59:34.485913 20776 net.cpp:235] inception_2a/3x3_bn needs backward computation. I1228 15:59:34.485918 20776 net.cpp:235] inception_2a/3x3 needs backward computation. I1228 15:59:34.485924 20776 net.cpp:235] inception_2a/relu_3x3_reduce needs backward computation. I1228 15:59:34.485929 20776 net.cpp:235] inception_2a/3x3_reduce_bn needs backward computation. I1228 15:59:34.485934 20776 net.cpp:235] inception_2a/3x3_reduce needs backward computation. I1228 15:59:34.485939 20776 net.cpp:235] inception_2a/relu_1x1 needs backward computation. I1228 15:59:34.485944 20776 net.cpp:235] inception_2a/1x1_bn needs backward computation. I1228 15:59:34.485949 20776 net.cpp:235] inception_2a/1x1 needs backward computation. I1228 15:59:34.485954 20776 net.cpp:235] inception_1b/output_inception_1b/output_0_split needs backward computation. I1228 15:59:34.485958 20776 net.cpp:235] inception_1b/output needs backward computation. I1228 15:59:34.485966 20776 net.cpp:235] inception_1b/pool needs backward computation. I1228 15:59:34.485971 20776 net.cpp:235] inception_1b/relu_double_3x3_2 needs backward computation. I1228 15:59:34.485975 20776 net.cpp:235] inception_1b/double_3x3_2_bn needs backward computation. I1228 15:59:34.485981 20776 net.cpp:235] inception_1b/double_3x3_2 needs backward computation. I1228 15:59:34.485986 20776 net.cpp:235] inception_1b/relu_double_3x3_1 needs backward computation. I1228 15:59:34.485991 20776 net.cpp:235] inception_1b/double_3x3_1_bn needs backward computation. I1228 15:59:34.485996 20776 net.cpp:235] inception_1b/double_3x3_1 needs backward computation. I1228 15:59:34.486001 20776 net.cpp:235] inception_1b/relu_double_3x3_reduce needs backward computation. I1228 15:59:34.486006 20776 net.cpp:235] inception_1b/double_3x3_reduce_bn needs backward computation. I1228 15:59:34.486011 20776 net.cpp:235] inception_1b/double_3x3_reduce needs backward computation. I1228 15:59:34.486016 20776 net.cpp:235] inception_1b/relu_3x3 needs backward computation. I1228 15:59:34.486021 20776 net.cpp:235] inception_1b/3x3_bn needs backward computation. I1228 15:59:34.486026 20776 net.cpp:235] inception_1b/3x3 needs backward computation. I1228 15:59:34.486030 20776 net.cpp:235] inception_1b/relu_3x3_reduce needs backward computation. I1228 15:59:34.486035 20776 net.cpp:235] inception_1b/3x3_reduce_bn needs backward computation. I1228 15:59:34.486040 20776 net.cpp:235] inception_1b/3x3_reduce needs backward computation. I1228 15:59:34.486047 20776 net.cpp:235] inception_1a/output_inception_1a/output_0_split needs backward computation. I1228 15:59:34.486052 20776 net.cpp:235] inception_1a/output needs backward computation. I1228 15:59:34.486058 20776 net.cpp:235] inception_1a/relu_pool_proj needs backward computation. I1228 15:59:34.486063 20776 net.cpp:235] inception_1a/pool_proj_bn needs backward computation. I1228 15:59:34.486068 20776 net.cpp:235] inception_1a/pool_proj needs backward computation. I1228 15:59:34.486073 20776 net.cpp:235] inception_1a/pool needs backward computation. I1228 15:59:34.486081 20776 net.cpp:235] inception_1a/relu_double_3x3_2 needs backward computation. I1228 15:59:34.486086 20776 net.cpp:235] inception_1a/double_3x3_2_bn needs backward computation. I1228 15:59:34.486091 20776 net.cpp:235] inception_1a/double_3x3_2 needs backward computation. I1228 15:59:34.486098 20776 net.cpp:235] inception_1a/relu_double_3x3_1 needs backward computation. I1228 15:59:34.486101 20776 net.cpp:235] inception_1a/double_3x3_1_bn needs backward computation. I1228 15:59:34.486107 20776 net.cpp:235] inception_1a/double_3x3_1 needs backward computation. I1228 15:59:34.486112 20776 net.cpp:235] inception_1a/relu_double_3x3_reduce needs backward computation. I1228 15:59:34.486117 20776 net.cpp:235] inception_1a/double_3x3_reduce_bn needs backward computation. I1228 15:59:34.486122 20776 net.cpp:235] inception_1a/double_3x3_reduce needs backward computation. I1228 15:59:34.486127 20776 net.cpp:235] inception_1a/relu_3x3 needs backward computation. I1228 15:59:34.486132 20776 net.cpp:235] inception_1a/3x3_bn needs backward computation. I1228 15:59:34.486137 20776 net.cpp:235] inception_1a/3x3 needs backward computation. I1228 15:59:34.486143 20776 net.cpp:235] inception_1a/relu_3x3_reduce needs backward computation. I1228 15:59:34.486147 20776 net.cpp:235] inception_1a/3x3_reduce_bn needs backward computation. I1228 15:59:34.486153 20776 net.cpp:235] inception_1a/3x3_reduce needs backward computation. I1228 15:59:34.486158 20776 net.cpp:235] inception_1a/relu_1x1 needs backward computation. I1228 15:59:34.486163 20776 net.cpp:235] inception_1a/1x1_bn needs backward computation. I1228 15:59:34.486168 20776 net.cpp:235] inception_1a/1x1 needs backward computation. I1228 15:59:34.486173 20776 net.cpp:235] pool1_pool1_0_split needs backward computation. I1228 15:59:34.486178 20776 net.cpp:235] pool1 needs backward computation. I1228 15:59:34.486183 20776 net.cpp:235] relu3 needs backward computation. I1228 15:59:34.486188 20776 net.cpp:235] conv3_bn needs backward computation. I1228 15:59:34.486193 20776 net.cpp:235] conv3 needs backward computation. I1228 15:59:34.486198 20776 net.cpp:235] relu2 needs backward computation. I1228 15:59:34.486203 20776 net.cpp:235] conv2_bn needs backward computation. I1228 15:59:34.486207 20776 net.cpp:235] conv2 needs backward computation. I1228 15:59:34.486212 20776 net.cpp:235] relu1 needs backward computation. I1228 15:59:34.486217 20776 net.cpp:235] conv1_bn needs backward computation. I1228 15:59:34.486222 20776 net.cpp:235] conv1 needs backward computation. I1228 15:59:34.486228 20776 net.cpp:237] label_data_1_split does not need backward computation. I1228 15:59:34.486233 20776 net.cpp:237] data does not need backward computation. I1228 15:59:34.486238 20776 net.cpp:278] This network produces output accuracy I1228 15:59:34.486243 20776 net.cpp:278] This network produces output loss I1228 15:59:34.486368 20776 net.cpp:290] Network initialization done. I1228 15:59:34.486373 20776 net.cpp:291] Memory required for data: 1239757288 I1228 15:59:34.487087 20776 solver.cpp:51] Solver scaffolding done. I1228 15:59:34.487517 20776 solver.cpp:257] Solving PRID I1228 15:59:34.487527 20776 solver.cpp:258] Learning Rate Policy: stepdecr I1228 15:59:34.515348 20776 solver.cpp:316] Iteration 0, Testing net (#0) I1228 15:59:43.190045 20775 solver.cpp:373] Test net output #0: accuracy = 0.00394737 I1228 15:59:43.190104 20775 solver.cpp:373] Test net output #1: loss = 5.95544 ( 1 = 5.95544 loss) I1228 15:59:43.510315 20776 solver.cpp:373] Test net output #0: accuracy = 0.00526316 I1228 15:59:43.510428 20776 solver.cpp:373] Test net output #1: loss = 5.95289 ( 1 = 5.95289 loss) F1228 15:59:44.622270 20776 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0) out of memory Check failure stack trace: @ 0x7f6b26b1ddaa (unknown) @ 0x7f6b26b1dce4 (unknown) @ 0x7f6b26b1d6e6 (unknown) @ 0x7f6b26b20687 (unknown) @ 0x7f6b270d39eb caffe::SyncedMemory::mutable_gpu_data() @ 0x7f6b270f6f23 caffe::Blob<>::mutable_gpu_diff() @ 0x7f6b2721431d caffe::ConvolutionLayer<>::Backward_gpu() @ 0x7f6b271e7abc caffe::Net<>::BackwardFromTo() @ 0x7f6b271e7d03 caffe::Net<>::Backward() @ 0x7f6b270f19db caffe::Solver<>::Step() @ 0x7f6b270f230f caffe::Solver<>::Solve() @ 0x4070e5 train() @ 0x4053b6 main @ 0x7f6b2602ff45 (unknown) @ 0x4059b1 (unknown) @ (nil) (unknown) Aborted at 1482911984 (unix time) try "date -d @1482911984" if you are using GNU date PC: @ 0x7ffeb8fc4ae3 (unknown) SIGTERM (@0x5125) received by PID 20775 (TID 0x7f3d143e5a40) from PID 20773; stack trace: @ 0x7f3d12b77cb0 (unknown) @ 0x7ffeb8fc4ae3 (unknown) @ 0x7f3d12c4985d (unknown) @ 0x7f3cf7f9059e (unknown) @ 0x7f3cf794651b (unknown) @ 0x7f3cf7923ca3 (unknown) @ 0x7f3cf791bec0 (unknown) @ 0x7f3cf791c7b3 (unknown) @ 0x7f3cf788a322 (unknown) @ 0x7f3cf788a47a (unknown) @ 0x7f3cf786df45 (unknown) @ 0x7f3d138a4e92 (unknown) @ 0x7f3d13889306 (unknown) @ 0x7f3d138ab328 (unknown) @ 0x7f3d13d2dd9e caffe::caffe_gpu_memcpy() @ 0x7f3d13c06b4e caffe::SyncedMemory::gpu_data() @ 0x7f3d13c29d72 caffe::Blob<>::gpu_data() @ 0x7f3d13d3212d caffe::BNLayer<>::Forward_gpu() @ 0x7f3d13d1a199 caffe::Net<>::ForwardFromTo() @ 0x7f3d13d1a5c7 caffe::Net<>::ForwardPrefilled() @ 0x7f3d13c249d0 caffe::Solver<>::Step() @ 0x7f3d13c2530f caffe::Solver<>::Solve() @ 0x4070e5 train() @ 0x4053b6 main @ 0x7f3d12b62f45 (unknown) @ 0x4059b1 (unknown) @ 0x0 (unknown)

mpirun noticed that process rank 1 with PID 20776 on node DL-PC exited on signal 6 (Aborted).

Extracting train set E1228 15:59:45.808471 20895 extract_features.cpp:54] Using GPU E1228 15:59:45.808827 20895 extract_features.cpp:60] Using Device_id=1 F1228 15:59:47.935324 20895 io.cpp:52] Check failed: fd != -1 (-1 vs. -1) File not found: external/exp/snapshots/individually/prid_iter_11000.caffemodel Check failure stack trace: @ 0x7fbdf8cd5daa (unknown) @ 0x7fbdf8cd5ce4 (unknown) @ 0x7fbdf8cd56e6 (unknown) @ 0x7fbdf8cd8687 (unknown) @ 0x7fbdf9073fb3 caffe::ReadProtoFromBinaryFile() @ 0x7fbdf906e214 caffe::ReadNetParamsFromBinaryFileOrDie() @ 0x7fbdf914b9c7 caffe::Net<>::CopyTrainedLayersFromBinaryProto() @ 0x7fbdf914ba36 caffe::Net<>::CopyTrainedLayersFrom() @ 0x4080b8 feature_extraction_pipeline<>() @ 0x7fbdf8105f45 (unknown) @ 0x403bce (unknown) @ (nil) (unknown) scripts/routines.sh: 行 93: 20895 已放弃 (核心已转储) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 1 Extracting val set E1228 15:59:50.017843 20935 extract_features.cpp:54] Using GPU E1228 15:59:50.018328 20935 extract_features.cpp:60] Using Device_id=1 F1228 15:59:52.139801 20935 io.cpp:52] Check failed: fd != -1 (-1 vs. -1) File not found: external/exp/snapshots/individually/prid_iter_11000.caffemodel Check failure stack trace: @ 0x7f28511bbdaa (unknown) @ 0x7f28511bbce4 (unknown) @ 0x7f28511bb6e6 (unknown) @ 0x7f28511be687 (unknown) @ 0x7f2851559fb3 caffe::ReadProtoFromBinaryFile() @ 0x7f2851554214 caffe::ReadNetParamsFromBinaryFileOrDie() @ 0x7f28516319c7 caffe::Net<>::CopyTrainedLayersFromBinaryProto() @ 0x7f2851631a36 caffe::Net<>::CopyTrainedLayersFrom() @ 0x4080b8 feature_extraction_pipeline<>() @ 0x7f28505ebf45 (unknown) @ 0x403bce (unknown) @ (nil) (unknown) scripts/routines.sh: 行 93: 20935 已放弃 (核心已转储) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 1 Extracting test_probe set E1228 15:59:54.131114 20955 extract_features.cpp:54] Using GPU E1228 15:59:54.131534 20955 extract_features.cpp:60] Using Device_id=1 F1228 15:59:56.317889 20955 io.cpp:52] Check failed: fd != -1 (-1 vs. -1) File not found: external/exp/snapshots/individually/prid_iter_11000.caffemodel Check failure stack trace: @ 0x7f5501ed8daa (unknown) @ 0x7f5501ed8ce4 (unknown) @ 0x7f5501ed86e6 (unknown) @ 0x7f5501edb687 (unknown) @ 0x7f5502276fb3 caffe::ReadProtoFromBinaryFile() @ 0x7f5502271214 caffe::ReadNetParamsFromBinaryFileOrDie() @ 0x7f550234e9c7 caffe::Net<>::CopyTrainedLayersFromBinaryProto() @ 0x7f550234ea36 caffe::Net<>::CopyTrainedLayersFrom() @ 0x4080b8 feature_extraction_pipeline<>() @ 0x7f5501308f45 (unknown) @ 0x403bce (unknown) @ (nil) (unknown) scripts/routines.sh: 行 93: 20955 已放弃 (核心已转储) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 1 Extracting test_gallery set E1228 15:59:58.303810 20973 extract_features.cpp:54] Using GPU E1228 15:59:58.304404 20973 extract_features.cpp:60] Using Device_id=1 F1228 16:00:00.515791 20973 io.cpp:52] Check failed: fd != -1 (-1 vs. -1) File not found: external/exp/snapshots/individually/prid_iter_11000.caffemodel Check failure stack trace: @ 0x7fa3d0d80daa (unknown) @ 0x7fa3d0d80ce4 (unknown) @ 0x7fa3d0d806e6 (unknown) @ 0x7fa3d0d83687 (unknown) @ 0x7fa3d111efb3 caffe::ReadProtoFromBinaryFile() @ 0x7fa3d1119214 caffe::ReadNetParamsFromBinaryFileOrDie() @ 0x7fa3d11f69c7 caffe::Net<>::CopyTrainedLayersFromBinaryProto() @ 0x7fa3d11f6a36 caffe::Net<>::CopyTrainedLayersFrom() @ 0x4080b8 feature_extraction_pipeline<>() @ 0x7fa3d01b0f45 (unknown) @ 0x403bce (unknown) @ (nil) (unknown) scripts/routines.sh: 行 93: 20973 已放弃 (核心已转储) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 1 Traceback (most recent call last): File "eval/metric_learning.py", line 104, in main(args) File "eval/metric_learning.py", line 85, in main M = _learn_metric(X, Y, args.method) File "eval/metric_learning.py", line 43, in _learn_metric M = np.eye(X.shape[1]) IndexError: tuple index out of range

Cysu commented 7 years ago

It seems that the error is Check failed: error == cudaSuccess (2 vs. 0) out of memory. May I have your GPU configurations with nvidia-smi?

Cysu commented 7 years ago

It seems that the openmpi has something wrong. Could you please show the results of the following commands?

which mpirun
mpirun --version
ldd external/caffe/build/tools/caffe | grep mpi

Also please include your outputs with proper Markdown syntax. Thank you.

montego7878 commented 7 years ago

I change to use one GPU ,but it still have error I0110 18:01:07.894806 19573 solver.cpp:373] Test net output #0: accuracy = 0.00526316 I0110 18:01:07.894840 19573 solver.cpp:373] Test net output #1: loss = 5.95449 (* 1 = 5.95449 loss) F0110 18:01:08.005344 19573 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0) out of memory Check failure stack trace: @ 0x7f1d188d5daa (unknown) @ 0x7f1d188d5ce4 (unknown) @ 0x7f1d188d56e6 (unknown) @ 0x7f1d188d8687 (unknown) @ 0x7f1d18fa809b caffe::SyncedMemory::mutable_gpu_data() @ 0x7f1d18fb5352 caffe::Blob<>::mutable_gpu_data() @ 0x7f1d1901ae40 caffe::CuDNNConvolutionLayer<>::Forward_gpu() @ 0x7f1d18f950ba caffe::Net<>::ForwardFromTo() @ 0x7f1d18f95437 caffe::Net<>::ForwardPrefilled() @ 0x7f1d18ff6ab0 caffe::Solver<>::Step() @ 0x7f1d18ff7254 caffe::Solver<>::Solve() @ 0x409fc0 train() @ 0x408196 main @ 0x7f1d17be3f45 (unknown) @ 0x40882f (unknown) @ (nil) (unknown) Extracting train set E0110 18:01:08.168467 19640 extract_features.cpp:54] Using GPU E0110 18:01:08.168717 19640 extract_features.cpp:60] Using Device_id=0 F0110 18:01:18.770076 19640 io.cpp:52] Check failed: fd != -1 (-1 vs. -1) File not found: external/exp/snapshots/individually/prid_iter_11000.caffemodel Check failure stack trace: @ 0x7f34e718cdaa (unknown) @ 0x7f34e718cce4 (unknown) @ 0x7f34e718c6e6 (unknown) @ 0x7f34e718f687 (unknown) @ 0x7f34e762ca03 caffe::ReadProtoFromBinaryFile() @ 0x7f34e76433a4 caffe::ReadNetParamsFromBinaryFileOrDie() @ 0x7f34e75f6647 caffe::Net<>::CopyTrainedLayersFromBinaryProto() @ 0x7f34e75f66b6 caffe::Net<>::CopyTrainedLayersFrom() @ 0x4080b8 feature_extraction_pipeline<>() @ 0x7f34e65bcf45 (unknown) @ 0x403bce (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 19640 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Extracting val set E0110 18:01:19.980247 19655 extract_features.cpp:54] Using GPU E0110 18:01:19.980475 19655 extract_features.cpp:60] Using Device_id=0 F0110 18:01:30.597614 19655 io.cpp:52] Check failed: fd != -1 (-1 vs. -1) File not found: external/exp/snapshots/individually/prid_iter_11000.caffemodel Check failure stack trace: @ 0x7f18657e8daa (unknown) @ 0x7f18657e8ce4 (unknown) @ 0x7f18657e86e6 (unknown) @ 0x7f18657eb687 (unknown) @ 0x7f1865c88a03 caffe::ReadProtoFromBinaryFile() @ 0x7f1865c9f3a4 caffe::ReadNetParamsFromBinaryFileOrDie() @ 0x7f1865c52647 caffe::Net<>::CopyTrainedLayersFromBinaryProto() @ 0x7f1865c526b6 caffe::Net<>::CopyTrainedLayersFrom() @ 0x4080b8 feature_extraction_pipeline<>() @ 0x7f1864c18f45 (unknown) @ 0x403bce (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 19655 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Extracting test_probe set E0110 18:01:31.850750 19672 extract_features.cpp:54] Using GPU E0110 18:01:31.850977 19672 extract_features.cpp:60] Using Device_id=0 F0110 18:01:42.372356 19672 io.cpp:52] Check failed: fd != -1 (-1 vs. -1) File not found: external/exp/snapshots/individually/prid_iter_11000.caffemodel Check failure stack trace: @ 0x7f29ac09ddaa (unknown) @ 0x7f29ac09dce4 (unknown) @ 0x7f29ac09d6e6 (unknown) @ 0x7f29ac0a0687 (unknown) @ 0x7f29ac53da03 caffe::ReadProtoFromBinaryFile() @ 0x7f29ac5543a4 caffe::ReadNetParamsFromBinaryFileOrDie() @ 0x7f29ac507647 caffe::Net<>::CopyTrainedLayersFromBinaryProto() @ 0x7f29ac5076b6 caffe::Net<>::CopyTrainedLayersFrom() @ 0x4080b8 feature_extraction_pipeline<>() @ 0x7f29ab4cdf45 (unknown) @ 0x403bce (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 19672 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Extracting test_gallery set E0110 18:01:43.553402 19710 extract_features.cpp:54] Using GPU E0110 18:01:43.553637 19710 extract_features.cpp:60] Using Device_id=0 F0110 18:01:54.306596 19710 io.cpp:52] Check failed: fd != -1 (-1 vs. -1) File not found: external/exp/snapshots/individually/prid_iter_11000.caffemodel Check failure stack trace: @ 0x7f7e6d9a4daa (unknown) @ 0x7f7e6d9a4ce4 (unknown) @ 0x7f7e6d9a46e6 (unknown) @ 0x7f7e6d9a7687 (unknown) @ 0x7f7e6de44a03 caffe::ReadProtoFromBinaryFile() @ 0x7f7e6de5b3a4 caffe::ReadNetParamsFromBinaryFileOrDie() @ 0x7f7e6de0e647 caffe::Net<>::CopyTrainedLayersFromBinaryProto() @ 0x7f7e6de0e6b6 caffe::Net<>::CopyTrainedLayersFrom() @ 0x4080b8 feature_extraction_pipeline<>() @ 0x7f7e6cdd4f45 (unknown) @ 0x403bce (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 19710 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Traceback (most recent call last): File "eval/metric_learning.py", line 104, in main(args) File "eval/metric_learning.py", line 85, in main M = _learn_metric(X, Y, args.method) File "eval/metric_learning.py", line 43, in _learn_metric M = np.eye(X.shape[1]) IndexError: tuple index out of range

montego7878 commented 7 years ago

My nvidia-smi 2017-01-10 18 05 58

Cysu commented 7 years ago

It says Check failed: error == cudaSuccess (2 vs. 0) out of memory. You may reduce the batch_size and increase the iter_size in solvers for compensation.

montego7878 commented 7 years ago

I reduce batch_size and increase the iter_size in solvers it success Snapshotting to binary proto file external/exp/snapshots/individually/prid_iter_100.caffemodel and Snapshotting solver state to binary proto fileexternal/exp/snapshots/individually/prid_iter_100.solverstate but in the Extract train, val, test probe, and test gallery features is error

Extracting train set E0111 16:45:52.847748 30130 extract_features.cpp:54] Using GPU E0111 16:45:52.847967 30130 extract_features.cpp:60] Using Device_id=0 E0111 16:46:03.843616 30130 extract_features.cpp:135] Extacting Features F0111 16:46:04.056460 30130 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0) out of memory Check failure stack trace: @ 0x7f2250083daa (unknown) @ 0x7f2250083ce4 (unknown) @ 0x7f22500836e6 (unknown) @ 0x7f2250086687 (unknown) @ 0x7f22504de25b caffe::SyncedMemory::mutable_gpu_data() @ 0x7f22504eb512 caffe::Blob<>::mutable_gpu_data() @ 0x7f225054a32a caffe::BNLayer<>::Forward_gpu() @ 0x7f22504cacb9 caffe::Net<>::ForwardFromTo() @ 0x7f22504cb105 caffe::Net<>::ForwardPrefilled() @ 0x408a63 feature_extraction_pipeline<>() @ 0x7f224f4b3f45 (unknown) @ 0x403bce (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 30130 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Extracting val set E0111 16:46:09.042475 30146 extract_features.cpp:54] Using GPU E0111 16:46:09.042714 30146 extract_features.cpp:60] Using Device_id=0 E0111 16:46:20.222183 30146 extract_features.cpp:135] Extacting Features F0111 16:46:20.436499 30146 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0) out of memory Check failure stack trace: @ 0x7f0bdd2e7daa (unknown) @ 0x7f0bdd2e7ce4 (unknown) @ 0x7f0bdd2e76e6 (unknown) @ 0x7f0bdd2ea687 (unknown) @ 0x7f0bdd74225b caffe::SyncedMemory::mutable_gpu_data() @ 0x7f0bdd74f512 caffe::Blob<>::mutable_gpu_data() @ 0x7f0bdd7ae32a caffe::BNLayer<>::Forward_gpu() @ 0x7f0bdd72ecb9 caffe::Net<>::ForwardFromTo() @ 0x7f0bdd72f105 caffe::Net<>::ForwardPrefilled() @ 0x408a63 feature_extraction_pipeline<>() @ 0x7f0bdc717f45 (unknown) @ 0x403bce (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 30146 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Extracting test_probe set E0111 16:46:21.580566 30161 extract_features.cpp:54] Using GPU E0111 16:46:21.580788 30161 extract_features.cpp:60] Using Device_id=0 E0111 16:46:32.858496 30161 extract_features.cpp:135] Extacting Features F0111 16:46:33.071115 30161 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0) out of memory Check failure stack trace: @ 0x7efc54f17daa (unknown) @ 0x7efc54f17ce4 (unknown) @ 0x7efc54f176e6 (unknown) @ 0x7efc54f1a687 (unknown) @ 0x7efc5537225b caffe::SyncedMemory::mutable_gpu_data() @ 0x7efc5537f512 caffe::Blob<>::mutable_gpu_data() @ 0x7efc553de32a caffe::BNLayer<>::Forward_gpu() @ 0x7efc5535ecb9 caffe::Net<>::ForwardFromTo() @ 0x7efc5535f105 caffe::Net<>::ForwardPrefilled() @ 0x408a63 feature_extraction_pipeline<>() @ 0x7efc54347f45 (unknown) @ 0x403bce (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 30161 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Extracting test_gallery set E0111 16:46:34.218818 30176 extract_features.cpp:54] Using GPU E0111 16:46:34.219038 30176 extract_features.cpp:60] Using Device_id=0 E0111 16:46:45.814304 30176 extract_features.cpp:135] Extacting Features F0111 16:46:46.028051 30176 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0) out of memory Check failure stack trace: @ 0x7faa9a149daa (unknown) @ 0x7faa9a149ce4 (unknown) @ 0x7faa9a1496e6 (unknown) @ 0x7faa9a14c687 (unknown) @ 0x7faa9a5a425b caffe::SyncedMemory::mutable_gpu_data() @ 0x7faa9a5b1512 caffe::Blob<>::mutable_gpu_data() @ 0x7faa9a61032a caffe::BNLayer<>::Forward_gpu() @ 0x7faa9a590cb9 caffe::Net<>::ForwardFromTo() @ 0x7faa9a591105 caffe::Net<>::ForwardPrefilled() @ 0x408a63 feature_extraction_pipeline<>() @ 0x7faa99579f45 (unknown) @ 0x403bce (unknown) @ (nil) (unknown) scripts/routines.sh: line 93: 30176 已經終止 (core dumped) ${CAFFE_DIR}/build/tools/extract_features ${trained_model} ${model} ${blob},label ${result_dir}/${subset}_features_lmdb,${result_dir}/${subset}_labels_lmdb ${num_iters} lmdb GPU 0 Traceback (most recent call last): File "eval/metric_learning.py", line 104, in main(args) File "eval/metric_learning.py", line 85, in main M = _learn_metric(X, Y, args.method) File "eval/metric_learning.py", line 43, in _learn_metric M = np.eye(X.shape[1]) IndexError: tuple index out of range

Cysu commented 7 years ago

Please reduce the batch_size here (e.g., to 20) and change this line correspondingly to

local num_iters=$(((num_samples + 19) / 20))
montego7878 commented 7 years ago

@Cysu Thanks very much.