Subdivision seems buggy

jingtianyilong commented 3 years ago

Great work by @argusswift I forked this repo and did some modifications to suit my case better. (You can also check my repo :-) )But the training part remains mostly the same. And I came across problem when playing with subdivision. I assume the accumulate in the training works the same se subdivision as is in the original darknet-yolov4. Basically you use this trick to enlarge the effective batch_size Here's how it works: https://github.com/jingtianyilong/YOLOv4-pytorch/blob/5c3020fb25bff83c0922e61c64ebf80ef5e96be7/train.py#L141-L144

I came across a problem when the accumulate is large. The ciou_loss and conf_loss would goes down just as normal at the beginning and jump directly to nan randomly. The training would soon becomes broken and finally results in a fail training. Smaller batch_size did help in some way but still very tricky. Also the training would takes way longer if the batch_size is too small.

I tried with batch_size=16 and accumulate= 1,2,4,8

accumulate=8 results in instant broken training in less than 100 iters.
accumulate=4 would break after quiet long time. 2k something iters
acccumulate=1,2 don't have this problem at all.

I also notice that the mAP goes up faster with accumulate=1.(haven't test with 2 yet)
Guess there's problem with the loss accumulation. But I don't know where the actual problem lies. Hope someone can help.

jingtianyilong commented 3 years ago

So my current solution is using accumulate=1 or in other words not using accumulate at all. But I assume there's still some way we can fix this and thus use a large effective batch size.

jingtianyilong commented 3 years ago

Seems that the case is more complecated than I though it would be.

I also test with more setting with accumulate=1, i.e. with different batch_size and lr_start. Still got nan despite I use smaller lr and batch_size. I tested with larger lr and batch size and longer training. They had all success. So this confuses me even more.

I trained 58/350 epoch successfully. Following is val result from epoch 58 and training log from epoch 59. The epoch 59 had only a few nan. But the val result after epoch 59 was a total disaster. Then, it took a few epoch to make all the loss nan and finally broke the whole training.

Here are some log:

2020-10-28 12:45:47,767 train.py[line:173] INFO: Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.2234 2020-10-28 12:45:47,767 train.py[line:174] INFO: Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.3992 2020-10-28 12:45:47,767 train.py[line:175] INFO: Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.2240 2020-10-28 12:45:47,767 train.py[line:176] INFO: Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.0397 2020-10-28 12:45:47,767 train.py[line:177] INFO: Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.3441 2020-10-28 12:45:47,767 train.py[line:178] INFO: Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.4997 2020-10-28 12:45:47,767 train.py[line:179] INFO: Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.1649 2020-10-28 12:45:47,767 train.py[line:180] INFO: Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.2506 2020-10-28 12:45:47,767 train.py[line:181] INFO: Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.2508 2020-10-28 12:45:47,767 train.py[line:182] INFO: Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.0527 2020-10-28 12:45:47,768 train.py[line:183] INFO: Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.3964 2020-10-28 12:45:47,768 train.py[line:184] INFO: Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.5336 2020-10-28 12:45:49,260 train.py[line:199] INFO: save weights done 2020-10-28 12:45:49,262 train.py[line:202] INFO: cost time:191.4015s 2020-10-28 12:45:51,766 train.py[line:149] INFO: 608: total_loss:30.4443 | loss_ciou:12.4035 | loss_conf:7.8357 | loss_cls:10.2051 | lr:0.003835 2020-10-28 12:45:53,564 train.py[line:149] INFO: 352: total_loss:28.2708 | loss_ciou:14.4318 | loss_conf:6.4564 | loss_cls:7.3827 | lr:0.003835 2020-10-28 12:45:55,686 train.py[line:149] INFO: 608: total_loss:33.1484 | loss_ciou:14.5905 | loss_conf:8.2250 | loss_cls:10.3329 | lr:0.003835 2020-10-28 12:45:57,614 train.py[line:149] INFO: 608: total_loss:22.9591 | loss_ciou:11.9179 | loss_conf:4.6685 | loss_cls:6.3727 | lr:0.003835 2020-10-28 12:45:59,531 train.py[line:149] INFO: 384: total_loss:22.0550 | loss_ciou:11.3789 | loss_conf:4.3689 | loss_cls:6.3072 | lr:0.003835 2020-10-28 12:46:01,500 train.py[line:149] INFO: 448: total_loss:28.3447 | loss_ciou:13.0998 | loss_conf:6.1485 | loss_cls:9.0965 | lr:0.003835 2020-10-28 12:46:03,362 train.py[line:149] INFO: 512: total_loss:nan | loss_ciou:nan | loss_conf:11.0406 | loss_cls:13.9988 | lr:0.003834 2020-10-28 12:46:05,287 train.py[line:149] INFO: 576: total_loss:42.7775 | loss_ciou:18.1268 | loss_conf:11.8337 | loss_cls:12.8171 | lr:0.003834 2020-10-28 12:46:07,247 train.py[line:149] INFO: 320: total_loss:48.0193 | loss_ciou:20.1435 | loss_conf:11.7303 | loss_cls:16.1454 | lr:0.003834 2020-10-28 12:46:09,186 train.py[line:149] INFO: 384: total_loss:36.0427 | loss_ciou:15.6349 | loss_conf:7.2014 | loss_cls:13.2064 | lr:0.003834 2020-10-28 12:46:11,308 train.py[line:149] INFO: 448: total_loss:29.9325 | loss_ciou:12.4251 | loss_conf:8.3712 | loss_cls:9.1362 | lr:0.003834 2020-10-28 12:46:13,333 train.py[line:149] INFO: 576: total_loss:nan | loss_ciou:nan | loss_conf:13.4491 | loss_cls:15.7087 | lr:0.003834 2020-10-28 12:46:15,182 train.py[line:149] INFO: 448: total_loss:26.4346 | loss_ciou:10.5886 | loss_conf:6.8389 | loss_cls:9.0072 | lr:0.003834 2020-10-28 12:46:17,218 train.py[line:149] INFO: 320: total_loss:32.4575 | loss_ciou:12.9395 | loss_conf:6.5627 | loss_cls:12.9553 | lr:0.003834

jingtianyilong commented 3 years ago

BTW, the training process broke at the end after a few more epoch with an ERROR:

Traceback (most recent call last): File "train.py", line 261, in Trainer(log_dir,resume= args.resume).train() File "train.py", line 137, in train scaled_loss.backward() File "/opt/conda/lib/python3.7/contextlib.py", line 119, in exit next(self.gen) File "/opt/conda/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss optimizer._post_amp_backward(loss_scaler) File "/opt/conda/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights post_backward_models_are_masters(scaler, params, stashed_grads) File "/opt/conda/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_models_are_masters scale_override=(grads_have_scale, stashed_have_scale, out_scale)) File "/opt/conda/lib/python3.7/site-packages/apex/amp/scaler.py", line 176, in unscale_with_stashed out_scale/grads_have_scale, # 1./scale, ZeroDivisionError: float division by zero

I would definately test with FP16 off and see if it's problem on mix precision.

jingtianyilong commented 3 years ago

Update: FP16 off but still got nan at the same epoch.

jingtianyilong commented 3 years ago

Tried extra 2 training with batch_size=16. Both work well. So the assumption now is small effective batchzise would lead to broken training. Need more testing still.

argusswift commented 3 years ago

Great work by @argusswift I forked this repo and did some modifications to suit my case better. (You can also check my repo :-) )But the training part remains mostly the same. And I came across problem when playing with subdivision. I assume the accumulate in the training works the same se subdivision as is in the original darknet-yolov4. Basically you use this trick to enlarge the effective batch_size Here's how it works: https://github.com/jingtianyilong/YOLOv4-pytorch/blob/5c3020fb25bff83c0922e61c64ebf80ef5e96be7/train.py#L141-L144

I came across a problem when the accumulate is large. The ciou_loss and conf_loss would goes down just as normal at the beginning and jump directly to nan randomly. The training would soon becomes broken and finally results in a fail training. Smaller batch_size did help in some way but still very tricky. Also the training would takes way longer if the batch_size is too small.

I tried with batch_size=16 and accumulate= 1,2,4,8

accumulate=8 results in instant broken training in less than 100 iters.

accumulate=4 would break after quiet long time. 2k something iters

acccumulate=1,2 don't have this problem at all.

I also notice that the mAP goes up faster with accumulate=1.(haven't test with 2 yet) Guess there's problem with the loss accumulation. But I don't know where the actual problem lies. Hope someone can help.

I'm sorry for the late reply. Great work by @jingtianyilong. You are so great that this repo will be unique project.Thank you again for your efforts on this repo.You also do show some new training techniques in the forked repo. Here is great project: https://github.com/jingtianyilong/YOLOv4-pytorch To your question, I can make two explanations： First of all,subdivision,the training acceleration techniques,is not my original technique.And I just modified it based on Ultralytics.I wanna thank Ultralytics for his great project. Here's how it works: https://github.com/ultralytics/yolov5 Second, for the situation that the CIOU loss and conf loss will be nan randomly,I think the gradient explosion is caused by the gradient accumulation.Here I would like to request you to check whether the every batch gradient is correct and the value would be 'nan'. When accumulate=1.the gradient update will be faster than accumulate=2,4,8 or other. Thank you again for your hard work.

jingtianyilong commented 3 years ago

Great work by @argusswift I forked this repo and did some modifications to suit my case better. (You can also check my repo :-) )But the training part remains mostly the same. And I came across problem when playing with subdivision. I assume the accumulate in the training works the same se subdivision as is in the original darknet-yolov4. Basically you use this trick to enlarge the effective batch_size Here's how it works: https://github.com/jingtianyilong/YOLOv4-pytorch/blob/5c3020fb25bff83c0922e61c64ebf80ef5e96be7/train.py#L141-L144 I came across a problem when the accumulate is large. The ciou_loss and conf_loss would goes down just as normal at the beginning and jump directly to nan randomly. The training would soon becomes broken and finally results in a fail training. Smaller batch_size did help in some way but still very tricky. Also the training would takes way longer if the batch_size is too small. I tried with batch_size=16 and accumulate= 1,2,4,8

accumulate=8 results in instant broken training in less than 100 iters.

accumulate=4 would break after quiet long time. 2k something iters

acccumulate=1,2 don't have this problem at all.

I also notice that the mAP goes up faster with accumulate=1.(haven't test with 2 yet) Guess there's problem with the loss accumulation. But I don't know where the actual problem lies. Hope someone can help.

I'm sorry for the late reply. Great work by @jingtianyilong. You are so great that this repo will be unique project.Thank you again for your efforts on this repo.You also do show some new training techniques in the forked repo. Here is great project: https://github.com/jingtianyilong/YOLOv4-pytorch To your question, I can make two explanations： First of all,subdivision,the training acceleration techniques,is not my original technique.And I just modified it based on Ultralytics.I wanna thank Ultralytics for his great project. Here's how it works: https://github.com/ultralytics/yolov5 Second, for the situation that the CIOU loss and conf loss will be nan randomly,I think the gradient explosion is caused by the gradient accumulation.Here I would like to request you to check whether the every batch gradient is correct and the value would be 'nan'. When accumulate=1.the gradient update will be faster than accumulate=2,4,8 or other. Thank you again for your hard work.

Also thank you for the reply, also for all the nice words. I would definately check further to see if there's anything we can fix preventing this random nan problem. Hopefully through some more experiments. Should any update, I would update here first.

argusswift commented 3 years ago

Great work by @argusswift I forked this repo and did some modifications to suit my case better. (You can also check my repo :-) )But the training part remains mostly the same. And I came across problem when playing with subdivision. I assume the accumulate in the training works the same se subdivision as is in the original darknet-yolov4. Basically you use this trick to enlarge the effective batch_size Here's how it works: https://github.com/jingtianyilong/YOLOv4-pytorch/blob/5c3020fb25bff83c0922e61c64ebf80ef5e96be7/train.py#L141-L144 I came across a problem when the accumulate is large. The ciou_loss and conf_loss would goes down just as normal at the beginning and jump directly to nan randomly. The training would soon becomes broken and finally results in a fail training. Smaller batch_size did help in some way but still very tricky. Also the training would takes way longer if the batch_size is too small. I tried with batch_size=16 and accumulate= 1,2,4,8

accumulate=8 results in instant broken training in less than 100 iters.

accumulate=4 would break after quiet long time. 2k something iters

acccumulate=1,2 don't have this problem at all.

I also notice that the mAP goes up faster with accumulate=1.(haven't test with 2 yet) Guess there's problem with the loss accumulation. But I don't know where the actual problem lies. Hope someone can help.

I'm sorry for the late reply. Great work by @jingtianyilong. You are so great that this repo will be unique project.Thank you again for your efforts on this repo.You also do show some new training techniques in the forked repo. Here is great project: https://github.com/jingtianyilong/YOLOv4-pytorch To your question, I can make two explanations： First of all,subdivision,the training acceleration techniques,is not my original technique.And I just modified it based on Ultralytics.I wanna thank Ultralytics for his great project. Here's how it works: https://github.com/ultralytics/yolov5 Second, for the situation that the CIOU loss and conf loss will be nan randomly,I think the gradient explosion is caused by the gradient accumulation.Here I would like to request you to check whether the every batch gradient is correct and the value would be 'nan'. When accumulate=1.the gradient update will be faster than accumulate=2,4,8 or other. Thank you again for your hard work.

Also thank you for the reply, also for all the nice words. I would definately check further to see if there's anything we can fix preventing this random nan problem. Hopefully through some more experiments. Should any update, I would update here first.

yeah,all pull requests and comments are welcome here.

jingtianyilong commented 3 years ago

Some new update. I checked loss part and I don't see any bug with ciou_loss.

Tested with more settings. Still, a large accumulate combinea with largebatch_size would cause nan in early stage. Reducing LR respectfully helps in some ways. Seems that the effective batch_size (batch_size*accumulate) is more important. Adequate combination of batch_size accumulate and LR is crucial according to my experience. Play carefully with them.

But I don't think the problem I mentioned earlier was causing by a large LR. It has been trained for 60 consecutive epoch with no problem at all and network has obviously converged. LR has drop down a little from the highest peak. The broken training was definatly caused by something else. I doubt if it was a problem with my dataset? Funny though, loss would also jump back to normal after nan.

jingtianyilong commented 3 years ago

I would like to close this since it never happens afterwards. Best practice sofar is to set a small accumulate

argusswift / YOLOv4-pytorch

Subdivision seems buggy #56