CSP training loss explodes

IgorDavidyuk commented 4 years ago

Hello! At some point of training I get negative GIOU, huge iou_loss and then NAN average loss. I get them on modified version of .cfg, default is trained well on the same dataset. I removed last yolo layer and made five anchors for each of the rest. Am I doing something wrong?

IgorDavidyuk commented 4 years ago

My [net] section looks like this

batch=66 subdivisions=11 width=640 height=384 channels=3 momentum=0.949 decay=0.0005 // angle=0 saturation = 1.6 exposure = 1.6 hue=.15

learning_rate=0.00161 burn_in=250 max_batches = 4000 policy=steps steps=1700,3500 scales=.2,.2

YOLO layer like this:

[convolutional] size=1 stride=1 pad=1 filters=30 activation=linear

[yolo] mask = 0,1,2,3,4 anchors = 14, 18, 21, 31, 26, 45, 40, 53, 31, 71, 50, 74, 43,110, 82, 67, 65, 99, 75,132 classes=1 num=10 jitter=.0 ignore_thresh = .7 truth_thresh = 1 // random=1 scale_x_y = 1.1 iou_thresh=0.213 cls_normalizer=1.0 iou_normalizer=0.07 uc_normalizer=0.07 iou_loss=ciou nms_kind=greedynms beta_nms=0.6 // focal_loss=1

AlexeyAB commented 4 years ago

At some point of training I get negative GIOU, huge iou_loss and then NAN average loss.

Show console output at this point or near it
Show chart.png with loss and mAP
Attach cfg-file that is renamed to my_cfg.txt file
What command do you use for training?
What dataset do you use for training?

IgorDavidyuk commented 4 years ago

Hi, @AlexeyAB, thanks for you response. I attached console output as training_logs.txt and my_cfg.txt as well. I use the following command './darknet detector train obj_1class.data my.cfg ./weights/csresnext50-panet-spp-original-optimal_final.conv.112 -map -clear' for training on my self-labeled dataset from surveillance cameras. I checked the markup with Yolo annotation tool and it is fine. Tried to find out if some specific pictures break training (it is possible that some are not labeled completely) but -show_imgs doesn't work for me. Also my chart.png is empty as I didn't manage to increase maximum value of the vertical axis :)

Anyway, csp model from your repo trains fine and gives nice results on the same dataset, I just tried to provide more relevant anchors for detector and leave just one object-class: person. my-cfg.txt training_logs.txt

AlexeyAB commented 4 years ago

why did you set burn_in=50 instead of 1000? Train burn_in=1000
If it doesn't help, try to use default anchors

Tried to find out if some specific pictures break training (it is possible that some are not labeled completely) but -show_imgs doesn't work for me.

What does it mean?

IgorDavidyuk commented 4 years ago

Tried to find out if some specific pictures break training (it is possible that some are not labeled completely) but -show_imgs doesn't work for me.

What does it mean?

I got the following:

Xlib: sequence lost (0x100f5 > 0x11b) in reply type 0x1c! [xcb] Unknown request in queue while dequeuing [xcb] Most likely this is a multi-threaded client and XInitThreads has not been called [xcb] Aborting, sorry about that. darknet: ../../src/xcb_io.c:165: dequeue_pending_request: Assertion `!xcb_xlib_unknown_req_in_deq' failed. Aborted (core dumped)

With

CUDA-version: 10010 (10020), cuDNN: 7.6.5, CUDNN_HALF=1, GPU count: 2
CUDNN_HALF=1 OpenCV version: 4.1.2 1 Prepare additional network for mAP calculation... compute_capability = 750, cudnn_half = 1 net.optimized_memory = 0

IgorDavidyuk commented 4 years ago

why did you set burn_in=50 instead of 1000? Train burn_in=1000

But the training should work without burn_in at all, right? I just have a pretty small dataset, I wanted just slightly polish weights (=

AlexeyAB commented 4 years ago

But the training should work without burn_in at all, right?

No. Since some of layers are initialized randomly, it will generate too high loss, deltas and will degrade backbone weights.

I just have a pretty small dataset, I wanted just slightly polish weights (=

Since you use csresnext50-panet-spp-original-optimal_final.conv.112 with weights for only 112 layers, then for some conv-layers there are no pre-trained weights, then you can't "just slightly polish weights".

You can try to "polish" the first 112 layers, by using stopbackward=1700 in the 112 layer, so layers 0 - 112 will be trained only since 1700 iteration.

IgorDavidyuk commented 4 years ago

You can try to "polish" the first 112 layers, by using stopbackward=1700 in the 112 layer, so layers 0 - 112 will be trained only since 1700 iteration.

It worked, thank you! But does it mean stopbackward=1 only stops gradients for one iteration?

AlexeyAB commented 4 years ago

stopbackward=0 - dosn't stop gradients stopbackward=1 - stop gradients for the duration of the training (for all iterations) stopbackward=1700 - top gradients for the first 1700 iterations

AlexeyAB commented 4 years ago

@WongKinYiu I think we should use stopbackward=2000 for backbone for training detectors as best practice like burn_in=1000.

WongKinYiu commented 4 years ago

@AlexeyAB

In mm detection, they only train batch-norm layers of backbone. and it can reduce training epoch from 10x to 1x~6x. maybe it will be a better solution than using stopbackward.

AlexeyAB commented 4 years ago

@WongKinYiu

Do they calculate only rolling_mean and rolling_variance or also train biases and scales? And do they get the same accuracy?

WongKinYiu commented 4 years ago

@AlexeyAB

I think they train all of parameters of batch norm layers. https://github.com/open-mmlab/mmdetection/blob/master/mmdet/models/backbones/resnet.py#L508-L515

Almost all models in their repository use this trick, and the performance may similar to or even higher than original paper. If use this trick, an object detector can achieve ~40% AP on MS COCO by training ~1 day on a 8 2080ti machine.

AlexeyAB commented 4 years ago

@WongKinYiu

I added support for the parameter: train_only_bn=1 https://github.com/AlexeyAB/darknet/commit/d4b2ed9d22210d33e438d9557e976f8053f1cf9b So for layer with train_only_bn=1 and all previous layers will be trained only Batch-normalization.

I think we should use both stopbackward=2000 and train_only_bn=1 for the last backbone-layer, because randomly initialized weights in neck & head will generate random deltas for the first 2000 iterations, it will degrades both conv-weights and bn-params.

WongKinYiu commented 4 years ago

@AlexeyAB

OK, I will design new experiments after next monday.

Luux commented 4 years ago

Training with adam=1 and a lower learning rate (0.0001 works for me, but I'm on a custom dataset) seems to work fine as well without any further adjustments due to its adaptive learning rates. May be worth a try.

AlexeyAB / darknet