Open YueXiNPU opened 6 years ago
@qingqing01, would be very appreciated if you can give us any feedback or update.
There are lots of things I have seen make a model diverge. 1, Too high of a learning rate(LR). The default LR is 0.001. I tried LR = 0.5e-3/0.1e-3/0.5e-4/0.1e-4. But it does not work for us. 2, I will check the input data and labels. Make sure you are not introducing the nan. Also make sure all of the target values are valid. Make sure the data is properly normalized. You probably want to have the pixels in the range [-1, 1] and not [0, 255].
Best Yue
Do you use the pre-trainde vgg model?
@qingqing01 Sure, As for this command _python -u train.py --batchsize=3 , I use the default parameter --pretrained_model=./vgg_ilsvrc_16_fc_reduced/.
@qingqing01 同问+1. 命令语句是 export CUDA_VISIBLE_DEVICES=0,1,2 python -u train.py --batch_size=3 --pretrained_model=vgg_ilsvrc_16_fc_reduced 显卡配置是3张1080ti(11G) 在pass 0,batch_size =750时出现face_loss和headloss都等于nan。处理时间为0.61. 麻烦解答,谢谢 ^^
Would be very appreciated if you could give some suggestions to us. Because we have dedicated ourselves to the training stage. @qingqing01
Generally speaking, can reduce lr. But I need to do some experiments. We nerver trained this model to used batch size 1 per gpu.
Hi, maybe it is owe to lacking of gt faces in the image '0_Parade_Parade_0_275.jpg', you can have a try to delete this file in the gt file. Otherwise, you can enlarge the batchsize for each gpu.
@YueXiNPU @neverland0621
@qingqing01 @takecareofbigboss Many thanks for your comments.
Thank you very much for your help. As you mentioned, I didn't find the groundtruth of image "0_Parade_Parade_0_275.jpg ", so I deleted the image in file "WIDER_val" . Unfortunately, it didn't work for the problem of loss "NAN". I doubt that the data we train has other similar problems like this . What's worse, When I training the model,my machine often shut down by itself . Could u give me advice to solve the problem ?
Hello everyone and @qingqing01. I am training PyramidBox (ECCV 2018) by Baidu with 3 GPUs of GTX 1080Ti, memory 11G. Training command is _python -u train.py --batchsize=3. I try batch_size=4/5/6, but it does not work for me. Nan is shown as below: GPU information is shown as below: Anyone can help me to solve this problem?