eric612 / MobileNet-YOLO

A caffe implementation of MobileNet-YOLO detection network
Other
865 stars 442 forks source link

Multi-GPU stuck, can only be used in a single #28

Open Royzon opened 6 years ago

Royzon commented 6 years ago

Multi-GPU stuck, can only be used in a single

eric612 commented 6 years ago

I guess it is prefetch problem

  1. set PREFETCH_COUNT to 3 in include/base_data_layer.hpp and make project again
  2. Keep 416 resize param and delete others resize param on train prototxt (like test prototxt)
Royzon commented 6 years ago

after i changed what you said, it didn't work well yet and stuck at below: "I1105 18:19:40.103994 17897 solver.cpp:208] Creating test net (#0) specified by test_net file: models/yolov3_mobilenetv1/yolov3_mobilenetv1_test.prototxt"

eric612 commented 6 years ago

Sorry , I don't have environment to test multi gpu training now .

I will keep it in first issue

wzjiang commented 6 years ago

you can install nccl such as $ git clone https://github.com/NVIDIA/nccl.git $ cd nccl $ sudo make install -j8 and modified cmakelist just like “caffe_option(USE_NCCL "Build Caffe with NCCL library support" ON)” and compile again project by cmake

Royzon commented 6 years ago

@wzjiang ,I have changed Cmakelist file and installed NCCL already, and compiled successfully with the info: NCCL-ON. What puzzles me is that the caffe from BVLC can run multiple GPUs, but this is not feasible.

wzjiang commented 6 years ago

I have no idea about that. Now I meet the same problem. I compiled successfully and run without any error but always stay at a certain step.

macqueen09 commented 5 years ago

How can I run syustem without GPU , I want to ran system only with CPU , what command could I used

eric612 commented 5 years ago

@macqueen09 The simplest way is set CPU_ONLY ON , remember to delete cmakecache and remake

And another way is

linquanxu commented 5 years ago

@eric612 I meet same problem. how to fix it? I set PREFETCH_COUNT to 3 and build with nccl without any error. but it stay as follows: image my solver is image

eric612 commented 5 years ago

I found the issue as caffe ssd , maybe it is a pre-processing problem , unfortunately , I don't have environment to test.

linquanxu commented 5 years ago

thank you for your reply,I will try.

eric612 commented 5 years ago

I have update a new version to solve prefetch problems, please try again

linquanxu commented 5 years ago

@eric612 thanks very much. but I still meet the same question as before.

train_yolov3_lite.sh:

#!/bin/bash
LOG=log/train-`date +%Y-%m-%d-%H-%M-%S`.log
../build/tools/caffe train --solver ./mobilenet_yolov3_lite_solver.prototxt --gpu=0,1 2>&1 | tee $LOG

mobilenet_yolov3_lite_solver.prototxt:

train_net: "mobilenet_yolov3_lite_train.prototxt"
test_net: "mobilenet_yolov3_lite_test.prototxt"
test_iter: 4952
test_interval: 1000
base_lr: 0.001
display: 10
max_iter: 50000
lr_policy: "multistep"
gamma: 0.5
weight_decay: 0.00005
snapshot: 1000
snapshot_prefix: "models/"
solver_mode: GPU
debug_info: false
snapshot_after_train: true
test_initialization: false
average_loss: 10
stepvalue: 10000
stepvalue: 20000
stepvalue: 30000
stepvalue: 40000
iter_size: 9
type: "RMSProp"
eval_type: "detection"
ap_version: "11point"
show_per_class_result: true

if I comment the test_net: "mobilenet_yolov3_lite_test.prototxt" test_iter: 4952 test_interval: 1000 , it runs well.

lqian commented 5 years ago

I meet the same issue. however the current version code cannot find PREFETCH_COUNT can you tell me the history version that has that code? @linquanxu