PaddlePaddle / models

Officially maintained, supported by PaddlePaddle, including CV, NLP, Speech, Rec, TS, big models and so on.
Apache License 2.0
6.91k stars 2.91k forks source link

Face detection only for one GPU? #1513

Open abcdvzz opened 5 years ago

abcdvzz commented 5 years ago

When I trained the face detection model, I found that if there are more than one GPU visible, the training would encounter error:

Aborted at 1544374351 (unix time) try "date -d @1544374351" if you are using GNU date PC: @ 0x0 (unknown) SIGSEGV (@0x0) received by PID 32735 (TID 0x7f5aa0848700) from PID 0; stack trace: @ 0x7f5aa0440390 (unknown) @ 0x0 (unknown)

qingqing01 commented 5 years ago

It can run on multi GPU.

export CUDA_VISIBLE_DEVICES=0,1,2,3 
python -u train.py --batch_size=16 --pretrained_model=vgg_ilsvrc_16_fc_reduced
abcdvzz commented 5 years ago

It can run on multi GPU.

export CUDA_VISIBLE_DEVICES=0,1,2,3 
python -u train.py --batch_size=16 --pretrained_model=vgg_ilsvrc_16_fc_reduced

Yeah, I tried it. It just cannot work. I saw this problem in another issue too. You didn't solve the problem but just CLOSED it.

qingqing01 commented 5 years ago

@abcdvzz Sorry, I closed this issue, since I find your env is not right in https://github.com/PaddlePaddle/models/issues/1512 . We can solve it in #1512 or open this issue. I'll try my best to help you solve these problems.

abcdvzz commented 5 years ago

@abcdvzz Sorry, I closed this issue, since I find your env is not right in #1512 . We can solve it in #1512 or open this issue. I'll try my best to help you solve these problems.

OK, that's ok. Pls open this issue. I've already solved the cuda issue. I only have two problems now. One is this multi gpu issue. Another one is validation issue:#1514 .

qingqing01 commented 5 years ago

There is no problem on mutli-GPU on our machine. You need to give more detailed error info. And please help to close the sloved issue. Thanks!

abcdvzz commented 5 years ago

When I use two gpus, I got this and that' all I got: Aborted at 1544374351 (unix time) try "date -d @1544374351" if you are using GNU date PC: @ 0x0 (unknown) SIGSEGV (@0x0) received by PID 32735 (TID 0x7f5aa0848700) from PID 0; stack trace: @ 0x7f5aa0440390 (unknown) @ 0x0 (unknown) When I use one gpu, It's ok.

qingqing01 commented 5 years ago

Is there any detailed info ? Is your execution command is as follows? Note, need to set export CUDA_VISIBLE_DEVICES=0,1, since the code here https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleCV/face_detection/train.py#L113

export CUDA_VISIBLE_DEVICES=0,1
python -u train.py --batch_size=8  # according to the memory, maybe 6    
abcdvzz commented 5 years ago

Is there any detailed info ? Is your execution command is as follows? Note, need to set export CUDA_VISIBLE_DEVICES=0,1, since the code here https://github.com/PaddlePaddle/models/blob/develop/fluid/PaddleCV/face_detection/train.py#L113

export CUDA_VISIBLE_DEVICES=0,1
python -u train.py --batch_size=8  # according to the memory, maybe 6    

That's all I got. I've already set the devices.

qingqing01 commented 5 years ago

@abcdvzz Is your problem is sovled? Could you paster more info? I can't determine the problem from above error log. And better to give the enviroment. You also can test other examples on multi-GPU. You also can set export GLOG_v=3 export GLOG_logtostderr=1 to try find more detailed info and paste it out.

abcdvzz commented 5 years ago

@abcdvzz Is your problem is sovled? Could you paster more info? I can't determine the problem from above error log. And better to give the enviroment. You also can test other examples on multi-GPU. You also can set export GLOG_v=3 export GLOG_logtostderr=1 to try find more detailed info and paste it out.

Sorry, No. I'll try your opinion next week because I'm busy with my paper these days. Thank u for your patience.