PaddlePaddle / models

Officially maintained, supported by PaddlePaddle, including CV, NLP, Speech, Rec, TS, big models and so on.
Apache License 2.0
6.9k stars 2.91k forks source link

Face and head loss are NAN, when training PyramidBox. Online Urgent! #1282

Open YueXiNPU opened 6 years ago

YueXiNPU commented 6 years ago

Hello everyone and @qingqing01. I am training PyramidBox (ECCV 2018) by Baidu with 3 GPUs of GTX 1080Ti, memory 11G. Training command is _python -u train.py --batchsize=3. I try batch_size=4/5/6, but it does not work for me. Nan is shown as below: default GPU information is shown as below: default Anyone can help me to solve this problem?

YueXiNPU commented 6 years ago

@qingqing01, would be very appreciated if you can give us any feedback or update.

YueXiNPU commented 6 years ago

There are lots of things I have seen make a model diverge. 1, Too high of a learning rate(LR). The default LR is 0.001. I tried LR = 0.5e-3/0.1e-3/0.5e-4/0.1e-4. But it does not work for us. 2, I will check the input data and labels. Make sure you are not introducing the nan. Also make sure all of the target values are valid. Make sure the data is properly normalized. You probably want to have the pixels in the range [-1, 1] and not [0, 255].

Best Yue

qingqing01 commented 6 years ago

Do you use the pre-trainde vgg model?

YueXiNPU commented 6 years ago

@qingqing01 Sure, As for this command _python -u train.py --batchsize=3 , I use the default parameter --pretrained_model=./vgg_ilsvrc_16_fc_reduced/.

neverland0621 commented 6 years ago

@qingqing01 同问+1. 命令语句是 export CUDA_VISIBLE_DEVICES=0,1,2 python -u train.py --batch_size=3 --pretrained_model=vgg_ilsvrc_16_fc_reduced 显卡配置是3张1080ti(11G) 在pass 0,batch_size =750时出现face_loss和headloss都等于nan。处理时间为0.61. 麻烦解答,谢谢 ^^

YueXiNPU commented 6 years ago

Would be very appreciated if you could give some suggestions to us. Because we have dedicated ourselves to the training stage. @qingqing01

qingqing01 commented 6 years ago

Generally speaking, can reduce lr. But I need to do some experiments. We nerver trained this model to used batch size 1 per gpu.

takecareofbigboss commented 6 years ago

Hi, maybe it is owe to lacking of gt faces in the image '0_Parade_Parade_0_275.jpg', you can have a try to delete this file in the gt file. Otherwise, you can enlarge the batchsize for each gpu.

takecareofbigboss commented 6 years ago

@YueXiNPU @neverland0621

YueXiNPU commented 6 years ago

@qingqing01 @takecareofbigboss Many thanks for your comments.

neverland0621 commented 6 years ago

Thank you very much for your help. As you mentioned, I didn't find the groundtruth of image "0_Parade_Parade_0_275.jpg ", so I deleted the image in file "WIDER_val" . Unfortunately, it didn't work for the problem of loss "NAN". I doubt that the data we train has other similar problems like this . What's worse, When I training the model,my machine often shut down by itself . Could u give me advice to solve the problem ?