Loss goes to NaN at 150K Iterations

killawhale2 commented 4 years ago

Through this issue, I've fixed the problem with the prior/reg loss weights as per the author's response (add 1e-6 to avoid divide by zero). However, I noticed that my loc_loss and reg_loss became NaN. I retried with clipping the gradients by setting the --clip_grad option as True. My loc_loss and reg_losses still became NaN at 150k iteration and the training failed. The exact command I ran was the following: python ssd/train_bidet_ssd.py --dataset VOC --data_root ./data/VOCdevkit/ --basenet ./ssd/pretrain/vgg16.pth --clip_grad true Any help would be appreciated.

Wuziyi616 commented 4 years ago

Hummm, that's quite strange, I will look into it. BTW, you can try evaling the model weight before loss goes to NaN and report the mAP here, it can help me determine what's the problem.

killawhale2 commented 4 years ago

Thank you again for your quick replies! I ran the eval code using the model weights from iteration 145K (just before the loss goes to NaN) and the mAP I got was 56.06. The AP for the categories are as follows:

AP for aeroplane = 0.6938 AP for bicycle = 0.6756 AP for bird = 0.4631 AP for boat = 0.4548 AP for bottle = 0.3000 AP for bus = 0.6628 AP for car = 0.7311 AP for cat = 0.6751 AP for chair = 0.3491 AP for cow = 0.4354 AP for diningtable = 0.5509 AP for dog = 0.5649 AP for horse = 0.7091 AP for motorbike = 0.6937 AP for person = 0.5767 AP for pottedplant = 0.2871 AP for sheep = 0.4872 AP for sofa = 0.6301 AP for train = 0.7178 AP for tvmonitor = 0.5533

Hope this information will be useful!

Wuziyi616 commented 4 years ago

According to my learning rate decay schedule, the lr at iteration 150k should be 1e-5. It's a small value and I don't think training will break at this point. Also according to my experience of training BiDet, the network should have converged at 150k iteration so I guess the mAP would be around 66.0 before loss goes to NaN.

BTW, I have to say that the training of binary neural networks especially binary detectors is very unstable. In my experiments, I have to watch its loss curve and sometimes manually adjust the learning rate if its training "breaks". The training of binary-SSD often breaks, while binary-Faster R-CNN is much more stable. One of the indicators that the training of binary-SSD breaks is that, if the cls loss (termed as 'conf' in the saved weight files) suddenly decreased largely in a few iterations (e.g. 3.55-->3.54-->3.52-->3.40), then we should kill the program and manually decay lr by 0.1 then continue training.

Besides, the lr decay schedule in config.py is just an empirical one, I tried running the code multiple times and sometimes you need to decay earlier to prevent training from breaking. Also, if I use different PyTorch version, you may get different results. For example, I set up several conda virtual environments on one Ubuntu server and tried running the code. For BiDet-SSD on Pascal VOC, I got mAP 66.6% using PyTorch 1.5 (2020.5), mAP 65.4% using PyTorch 1.2 (2020.3), and the mAP 66.0% reported in the paper was obtained via PyTorch 1.0 (2019.11). I really don't know why, maybe just because the training of binary neural networks is too unstable and full of uncertainty. Different lr schedules even different weight initialization would cause different results.

So my suggestion for you is that, maybe you can try training again and monitor the loss of BiDet-SSD. Manually decay the learning rate can ensure you a more stable results I think.

Wuziyi616 commented 4 years ago

Ah, I'm sorry I didn't see the response you post before my last comment. 56 is much lower than 66 and there should be something wrong in the training procedure. Perhaps the training "breaks" earlier before 145k iteration? Does the conf loss decrease abnormally as I described in my last comment (decrease largely in 5k iterations)? If so, then the performance of weight at 145k iteration is surely to be affected to perform badly.

killawhale2 commented 4 years ago

I checked and the conf_loss is relatively stable but jumps from 0.6536 to 2.2935 from 145K iteration to 150K iteration. I'll try different learning rate schedulings as suggested.

Wuziyi616 commented 4 years ago

Indeed, in order to get good performance, I'd recommend you to monitor the loss and decay lr only when training breaks in current lr (conf loss decrease rapidly). The best way is to kill the program if training breaks and re-start with a decayed lr from the weight before breaking. I guess this is because binary neural networks are easily stuck to local minimal, so the more iterations you use large lr to train, the less likely you will be stuck to local minimal and the better performance you will get (at least in the case of binary detectors).

matthai commented 3 years ago

I checked and the conf_loss is relatively stable but jumps from 0.6536 to 2.2935 from 145K iteration to 150K iteration. I'll try different learning rate schedulings as suggested.

@killawhale2 were you able to get to ~65% accuracy in the end? It would be great to hear from folks who have managed to replicate it (so we can try to make a robust recipe).

ZiweiWangTHU / BiDet

Loss goes to NaN at 150K Iterations #8