Closed jingtianyilong closed 3 years ago
So my current solution is using accumulate=1 or in other words not using accumulate at all. But I assume there's still some way we can fix this and thus use a large effective batch size.
Seems that the case is more complecated than I though it would be.
I also test with more setting with accumulate=1
, i.e. with different batch_size
and lr_start
. Still got nan
despite I use smaller lr and batch_size. I tested with larger lr and batch size and longer training. They had all success. So this confuses me even more.
I trained 58/350 epoch successfully. Following is val result from epoch 58 and training log from epoch 59. The epoch 59 had only a few nan
. But the val result after epoch 59 was a total disaster. Then, it took a few epoch to make all the loss nan
and finally broke the whole training.
Here are some log:
2020-10-28 12:45:47,767 train.py[line:173] INFO: Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.2234 2020-10-28 12:45:47,767 train.py[line:174] INFO: Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.3992 2020-10-28 12:45:47,767 train.py[line:175] INFO: Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.2240 2020-10-28 12:45:47,767 train.py[line:176] INFO: Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.0397 2020-10-28 12:45:47,767 train.py[line:177] INFO: Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.3441 2020-10-28 12:45:47,767 train.py[line:178] INFO: Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.4997 2020-10-28 12:45:47,767 train.py[line:179] INFO: Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.1649 2020-10-28 12:45:47,767 train.py[line:180] INFO: Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.2506 2020-10-28 12:45:47,767 train.py[line:181] INFO: Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.2508 2020-10-28 12:45:47,767 train.py[line:182] INFO: Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.0527 2020-10-28 12:45:47,768 train.py[line:183] INFO: Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.3964 2020-10-28 12:45:47,768 train.py[line:184] INFO: Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.5336 2020-10-28 12:45:49,260 train.py[line:199] INFO: save weights done 2020-10-28 12:45:49,262 train.py[line:202] INFO: cost time:191.4015s 2020-10-28 12:45:51,766 train.py[line:149] INFO: 608: total_loss:30.4443 | loss_ciou:12.4035 | loss_conf:7.8357 | loss_cls:10.2051 | lr:0.003835 2020-10-28 12:45:53,564 train.py[line:149] INFO: 352: total_loss:28.2708 | loss_ciou:14.4318 | loss_conf:6.4564 | loss_cls:7.3827 | lr:0.003835 2020-10-28 12:45:55,686 train.py[line:149] INFO: 608: total_loss:33.1484 | loss_ciou:14.5905 | loss_conf:8.2250 | loss_cls:10.3329 | lr:0.003835 2020-10-28 12:45:57,614 train.py[line:149] INFO: 608: total_loss:22.9591 | loss_ciou:11.9179 | loss_conf:4.6685 | loss_cls:6.3727 | lr:0.003835 2020-10-28 12:45:59,531 train.py[line:149] INFO: 384: total_loss:22.0550 | loss_ciou:11.3789 | loss_conf:4.3689 | loss_cls:6.3072 | lr:0.003835 2020-10-28 12:46:01,500 train.py[line:149] INFO: 448: total_loss:28.3447 | loss_ciou:13.0998 | loss_conf:6.1485 | loss_cls:9.0965 | lr:0.003835 2020-10-28 12:46:03,362 train.py[line:149] INFO: 512: total_loss:nan | loss_ciou:nan | loss_conf:11.0406 | loss_cls:13.9988 | lr:0.003834 2020-10-28 12:46:05,287 train.py[line:149] INFO: 576: total_loss:42.7775 | loss_ciou:18.1268 | loss_conf:11.8337 | loss_cls:12.8171 | lr:0.003834 2020-10-28 12:46:07,247 train.py[line:149] INFO: 320: total_loss:48.0193 | loss_ciou:20.1435 | loss_conf:11.7303 | loss_cls:16.1454 | lr:0.003834 2020-10-28 12:46:09,186 train.py[line:149] INFO: 384: total_loss:36.0427 | loss_ciou:15.6349 | loss_conf:7.2014 | loss_cls:13.2064 | lr:0.003834 2020-10-28 12:46:11,308 train.py[line:149] INFO: 448: total_loss:29.9325 | loss_ciou:12.4251 | loss_conf:8.3712 | loss_cls:9.1362 | lr:0.003834 2020-10-28 12:46:13,333 train.py[line:149] INFO: 576: total_loss:nan | loss_ciou:nan | loss_conf:13.4491 | loss_cls:15.7087 | lr:0.003834 2020-10-28 12:46:15,182 train.py[line:149] INFO: 448: total_loss:26.4346 | loss_ciou:10.5886 | loss_conf:6.8389 | loss_cls:9.0072 | lr:0.003834 2020-10-28 12:46:17,218 train.py[line:149] INFO: 320: total_loss:32.4575 | loss_ciou:12.9395 | loss_conf:6.5627 | loss_cls:12.9553 | lr:0.003834
BTW, the training process broke at the end after a few more epoch with an ERROR:
Traceback (most recent call last): File "train.py", line 261, in
Trainer(log_dir,resume= args.resume).train() File "train.py", line 137, in train scaled_loss.backward() File "/opt/conda/lib/python3.7/contextlib.py", line 119, in exit next(self.gen) File "/opt/conda/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss optimizer._post_amp_backward(loss_scaler) File "/opt/conda/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights post_backward_models_are_masters(scaler, params, stashed_grads) File "/opt/conda/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_models_are_masters scale_override=(grads_have_scale, stashed_have_scale, out_scale)) File "/opt/conda/lib/python3.7/site-packages/apex/amp/scaler.py", line 176, in unscale_with_stashed out_scale/grads_have_scale, # 1./scale, ZeroDivisionError: float division by zero
I would definately test with FP16 off and see if it's problem on mix precision.
Update:
FP16 off but still got nan
at the same epoch.
Tried extra 2 training with batch_size=16. Both work well. So the assumption now is small effective batchzise would lead to broken training. Need more testing still.
Great work by @argusswift I forked this repo and did some modifications to suit my case better. (You can also check my repo :-) )But the training part remains mostly the same. And I came across problem when playing with subdivision. I assume the
accumulate
in the training works the same sesubdivision
as is in the originaldarknet-yolov4
. Basically you use this trick to enlarge the effectivebatch_size
Here's how it works: https://github.com/jingtianyilong/YOLOv4-pytorch/blob/5c3020fb25bff83c0922e61c64ebf80ef5e96be7/train.py#L141-L144I came across a problem when the
accumulate
is large. Theciou_loss
andconf_loss
would goes down just as normal at the beginning and jump directly tonan
randomly. The training would soon becomes broken and finally results in a fail training. Smallerbatch_size
did help in some way but still very tricky. Also the training would takes way longer if thebatch_size
is too small.I tried with
batch_size=16
andaccumulate= 1,2,4,8
- accumulate=8 results in instant broken training in less than 100 iters.
- accumulate=4 would break after quiet long time. 2k something iters
- acccumulate=1,2 don't have this problem at all.
I also notice that the mAP goes up faster with accumulate=1.(haven't test with 2 yet) Guess there's problem with the loss accumulation. But I don't know where the actual problem lies. Hope someone can help.
I'm sorry for the late reply. Great work by @jingtianyilong. You are so great that this repo will be unique project.Thank you again for your efforts on this repo.You also do show some new training techniques in the forked repo. Here is great project: https://github.com/jingtianyilong/YOLOv4-pytorch To your question, I can make two explanations: First of all,subdivision,the training acceleration techniques,is not my original technique.And I just modified it based on Ultralytics.I wanna thank Ultralytics for his great project. Here's how it works: https://github.com/ultralytics/yolov5 Second, for the situation that the CIOU loss and conf loss will be nan randomly,I think the gradient explosion is caused by the gradient accumulation.Here I would like to request you to check whether the every batch gradient is correct and the value would be 'nan'. When accumulate=1.the gradient update will be faster than accumulate=2,4,8 or other. Thank you again for your hard work.
Great work by @argusswift I forked this repo and did some modifications to suit my case better. (You can also check my repo :-) )But the training part remains mostly the same. And I came across problem when playing with subdivision. I assume the
accumulate
in the training works the same sesubdivision
as is in the originaldarknet-yolov4
. Basically you use this trick to enlarge the effectivebatch_size
Here's how it works: https://github.com/jingtianyilong/YOLOv4-pytorch/blob/5c3020fb25bff83c0922e61c64ebf80ef5e96be7/train.py#L141-L144 I came across a problem when theaccumulate
is large. Theciou_loss
andconf_loss
would goes down just as normal at the beginning and jump directly tonan
randomly. The training would soon becomes broken and finally results in a fail training. Smallerbatch_size
did help in some way but still very tricky. Also the training would takes way longer if thebatch_size
is too small. I tried withbatch_size=16
andaccumulate= 1,2,4,8
- accumulate=8 results in instant broken training in less than 100 iters.
- accumulate=4 would break after quiet long time. 2k something iters
- acccumulate=1,2 don't have this problem at all.
I also notice that the mAP goes up faster with accumulate=1.(haven't test with 2 yet) Guess there's problem with the loss accumulation. But I don't know where the actual problem lies. Hope someone can help.
I'm sorry for the late reply. Great work by @jingtianyilong. You are so great that this repo will be unique project.Thank you again for your efforts on this repo.You also do show some new training techniques in the forked repo. Here is great project: https://github.com/jingtianyilong/YOLOv4-pytorch To your question, I can make two explanations: First of all,subdivision,the training acceleration techniques,is not my original technique.And I just modified it based on Ultralytics.I wanna thank Ultralytics for his great project. Here's how it works: https://github.com/ultralytics/yolov5 Second, for the situation that the CIOU loss and conf loss will be nan randomly,I think the gradient explosion is caused by the gradient accumulation.Here I would like to request you to check whether the every batch gradient is correct and the value would be 'nan'. When accumulate=1.the gradient update will be faster than accumulate=2,4,8 or other. Thank you again for your hard work.
Also thank you for the reply, also for all the nice words. I would definately check further to see if there's anything we can fix preventing this random nan
problem. Hopefully through some more experiments. Should any update, I would update here first.
Great work by @argusswift I forked this repo and did some modifications to suit my case better. (You can also check my repo :-) )But the training part remains mostly the same. And I came across problem when playing with subdivision. I assume the
accumulate
in the training works the same sesubdivision
as is in the originaldarknet-yolov4
. Basically you use this trick to enlarge the effectivebatch_size
Here's how it works: https://github.com/jingtianyilong/YOLOv4-pytorch/blob/5c3020fb25bff83c0922e61c64ebf80ef5e96be7/train.py#L141-L144 I came across a problem when theaccumulate
is large. Theciou_loss
andconf_loss
would goes down just as normal at the beginning and jump directly tonan
randomly. The training would soon becomes broken and finally results in a fail training. Smallerbatch_size
did help in some way but still very tricky. Also the training would takes way longer if thebatch_size
is too small. I tried withbatch_size=16
andaccumulate= 1,2,4,8
- accumulate=8 results in instant broken training in less than 100 iters.
- accumulate=4 would break after quiet long time. 2k something iters
- acccumulate=1,2 don't have this problem at all.
I also notice that the mAP goes up faster with accumulate=1.(haven't test with 2 yet) Guess there's problem with the loss accumulation. But I don't know where the actual problem lies. Hope someone can help.
I'm sorry for the late reply. Great work by @jingtianyilong. You are so great that this repo will be unique project.Thank you again for your efforts on this repo.You also do show some new training techniques in the forked repo. Here is great project: https://github.com/jingtianyilong/YOLOv4-pytorch To your question, I can make two explanations: First of all,subdivision,the training acceleration techniques,is not my original technique.And I just modified it based on Ultralytics.I wanna thank Ultralytics for his great project. Here's how it works: https://github.com/ultralytics/yolov5 Second, for the situation that the CIOU loss and conf loss will be nan randomly,I think the gradient explosion is caused by the gradient accumulation.Here I would like to request you to check whether the every batch gradient is correct and the value would be 'nan'. When accumulate=1.the gradient update will be faster than accumulate=2,4,8 or other. Thank you again for your hard work.
Also thank you for the reply, also for all the nice words. I would definately check further to see if there's anything we can fix preventing this random
nan
problem. Hopefully through some more experiments. Should any update, I would update here first.
yeah,all pull requests and comments are welcome here.
Some new update. I checked loss part and I don't see any bug with ciou_loss.
Tested with more settings. Still, a large accumulate
combinea with largebatch_size
would cause nan
in early stage. Reducing LR respectfully helps in some ways. Seems that the effective batch_size (batch_size*accumulate) is more important. Adequate combination of batch_size accumulate and LR is crucial according to my experience. Play carefully with them.
But I don't think the problem I mentioned earlier was causing by a large LR. It has been trained for 60 consecutive epoch with no problem at all and network has obviously converged. LR has drop down a little from the highest peak. The broken training was definatly caused by something else. I doubt if it was a problem with my dataset?
Funny though, loss would also jump back to normal after nan
.
I would like to close this since it never happens afterwards. Best practice sofar is to set a small accumulate
Great work by @argusswift I forked this repo and did some modifications to suit my case better. (You can also check my repo :-) )But the training part remains mostly the same. And I came across problem when playing with subdivision. I assume the
accumulate
in the training works the same sesubdivision
as is in the originaldarknet-yolov4
. Basically you use this trick to enlarge the effectivebatch_size
Here's how it works: https://github.com/jingtianyilong/YOLOv4-pytorch/blob/5c3020fb25bff83c0922e61c64ebf80ef5e96be7/train.py#L141-L144I came across a problem when the
accumulate
is large. Theciou_loss
andconf_loss
would goes down just as normal at the beginning and jump directly tonan
randomly. The training would soon becomes broken and finally results in a fail training. Smallerbatch_size
did help in some way but still very tricky. Also the training would takes way longer if thebatch_size
is too small.I tried with
batch_size=16
andaccumulate= 1,2,4,8
I also notice that the mAP goes up faster with accumulate=1.(haven't test with 2 yet)
Guess there's problem with the loss accumulation. But I don't know where the actual problem lies. Hope someone can help.