multi gpu still stuck - Githubissues

eric612 / MobileNet-YOLO

A caffe implementation of MobileNet-YOLO detection network

Other

865 stars 442 forks source link

multi gpu still stuck #137

Open Amanda-Barbara opened 5 years ago

Amanda-Barbara commented 5 years ago

hi, I have tried your newst version of MobileNet-YOLO to train with multi gpu, but the gpus still seized up and stopped the step like this: I0724 04:24:32.355298 8003 solver.cpp:203] Creating test net (#0) specified by test_net file: models/yolov3/head_mobilenet_yolov3_lite_test.prototxt can you give any idea? thanks @eric612

eric612 commented 5 years ago

Try to change the batch size in test prototxt

solomon-ma commented 5 years ago

I meet the same problem. Could you tell me the size you changed?

eric612 commented 5 years ago

I think the batch size = 1 can't not be split to multi-gpu training in test phase , so you can close the test phase and start training .

solomon-ma commented 5 years ago

I tries the batch size = 1, but it also stuck.

I find the situation that your project is forked from caffe-ssd which is also stuck in multi-gpu. But I tried the caffe source code from BVLC, it could be run using multi-gpu with NCCL. And I tried the caffe writted by yjxiong, which is wrote with openmpi to do the multi-gpu work.

I'll try to use the BVLC code to rewrite the caffe-mobilenet-yolo. Could you help me if I have some problems?

eric612 commented 5 years ago

Unfortunately , I don't have multi-gpu computer or environment :(

So , it is really hard for me , maybe you can see this issue https://github.com/eric612/MobileNet-YOLO/issues/28

TccccD commented 5 years ago

@solomon-ma @Amanda-Barbara , I also encountered the same problem, I changed the batch_size to 4 (the same number as my gpus), still stopped at "Creating test net (#0) specified by test_net file"; Have you solved this problem? If you can solve it, can you tell me?

jerryho-quanta commented 5 years ago

Hi Guys, I also meet the same problem, even I use the example ./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt --gpu 0,1

RamatovInomjon commented 4 years ago

I think the batch size = 1 can't not be split to multi-gpu training in test phase , so you can close the test phase and start training .

Thanks for your great work! Yes, you are right, training on multi gpus is working after closing testing phase but still confused , why it is stopped in testing phase, even I set the testing batch size 4, (I am using 2 gpus)

eric612 commented 4 years ago

Refer this issue https://github.com/eric612/MobileNet-YOLO/issues/198

guagua11 commented 4 years ago

I think the batch size = 1 can't not be split to multi-gpu training in test phase , so you can close the test phase and start training .

Thanks for your great work! Yes, you are right, training on multi gpus is working after closing testing phase but still confused , why it is stopped in testing phase, even I set the testing batch size 4, (I am using 2 gpus)

hi,could you tell me the close test phase step? thanks.

eric612 commented 4 years ago

@guagua11 Remove https://github.com/eric612/MobileNet-YOLO/blob/master/models/mobilenetv2_voc/yolo_lite/solver.prototxt#L2-L4