Closed IgorDavidyuk closed 3 years ago
batch=66 subdivisions=11 width=640 height=384 channels=3 momentum=0.949 decay=0.0005 // angle=0 saturation = 1.6 exposure = 1.6 hue=.15
learning_rate=0.00161 burn_in=250 max_batches = 4000 policy=steps steps=1700,3500 scales=.2,.2
[convolutional] size=1 stride=1 pad=1 filters=30 activation=linear
[yolo] mask = 0,1,2,3,4 anchors = 14, 18, 21, 31, 26, 45, 40, 53, 31, 71, 50, 74, 43,110, 82, 67, 65, 99, 75,132 classes=1 num=10 jitter=.0 ignore_thresh = .7 truth_thresh = 1 // random=1 scale_x_y = 1.1 iou_thresh=0.213 cls_normalizer=1.0 iou_normalizer=0.07 uc_normalizer=0.07 iou_loss=ciou nms_kind=greedynms beta_nms=0.6 // focal_loss=1
At some point of training I get negative GIOU, huge iou_loss and then NAN average loss.
Show console output at this point or near it
Show chart.png with loss and mAP
Attach cfg-file that is renamed to my_cfg.txt file
What command do you use for training?
What dataset do you use for training?
Hi, @AlexeyAB, thanks for you response. I attached console output as training_logs.txt and my_cfg.txt as well. I use the following command './darknet detector train obj_1class.data my.cfg ./weights/csresnext50-panet-spp-original-optimal_final.conv.112 -map -clear' for training on my self-labeled dataset from surveillance cameras. I checked the markup with Yolo annotation tool and it is fine. Tried to find out if some specific pictures break training (it is possible that some are not labeled completely) but -show_imgs doesn't work for me. Also my chart.png is empty as I didn't manage to increase maximum value of the vertical axis :)
Anyway, csp model from your repo trains fine and gives nice results on the same dataset, I just tried to provide more relevant anchors for detector and leave just one object-class: person. my-cfg.txt training_logs.txt
why did you set burn_in=50 instead of 1000? Train burn_in=1000
If it doesn't help, try to use default anchors
Tried to find out if some specific pictures break training (it is possible that some are not labeled completely) but -show_imgs doesn't work for me.
What does it mean?
Tried to find out if some specific pictures break training (it is possible that some are not labeled completely) but -show_imgs doesn't work for me.
What does it mean?
Xlib: sequence lost (0x100f5 > 0x11b) in reply type 0x1c! [xcb] Unknown request in queue while dequeuing [xcb] Most likely this is a multi-threaded client and XInitThreads has not been called [xcb] Aborting, sorry about that. darknet: ../../src/xcb_io.c:165: dequeue_pending_request: Assertion `!xcb_xlib_unknown_req_in_deq' failed. Aborted (core dumped)
CUDA-version: 10010 (10020), cuDNN: 7.6.5, CUDNN_HALF=1, GPU count: 2
CUDNN_HALF=1
OpenCV version: 4.1.2
1
Prepare additional network for mAP calculation...
compute_capability = 750, cudnn_half = 1
net.optimized_memory = 0
- why did you set burn_in=50 instead of 1000? Train
burn_in=1000
But the training should work without burn_in at all, right? I just have a pretty small dataset, I wanted just slightly polish weights (=
But the training should work without burn_in at all, right?
No. Since some of layers are initialized randomly, it will generate too high loss, deltas and will degrade backbone weights.
I just have a pretty small dataset, I wanted just slightly polish weights (=
Since you use csresnext50-panet-spp-original-optimal_final.conv.112
with weights for only 112 layers, then for some conv-layers there are no pre-trained weights, then you can't "just slightly polish weights".
You can try to "polish" the first 112 layers, by using stopbackward=1700
in the 112 layer, so layers 0 - 112 will be trained only since 1700 iteration.
You can try to "polish" the first 112 layers, by using
stopbackward=1700
in the 112 layer, so layers 0 - 112 will be trained only since 1700 iteration.
It worked, thank you!
But does it mean stopbackward=1
only stops gradients for one iteration?
stopbackward=0 - dosn't stop gradients stopbackward=1 - stop gradients for the duration of the training (for all iterations) stopbackward=1700 - top gradients for the first 1700 iterations
@WongKinYiu I think we should use stopbackward=2000
for backbone for training detectors as best practice like burn_in=1000.
@AlexeyAB
In mm detection, they only train batch-norm layers of backbone. and it can reduce training epoch from 10x to 1x~6x. maybe it will be a better solution than using stopbackward.
@WongKinYiu
Do they calculate only rolling_mean
and rolling_variance
or also train biases
and scales
?
And do they get the same accuracy?
@AlexeyAB
I think they train all of parameters of batch norm layers. https://github.com/open-mmlab/mmdetection/blob/master/mmdet/models/backbones/resnet.py#L508-L515
Almost all models in their repository use this trick, and the performance may similar to or even higher than original paper. If use this trick, an object detector can achieve ~40% AP on MS COCO by training ~1 day on a 8 2080ti machine.
@WongKinYiu
I added support for the parameter: train_only_bn=1
https://github.com/AlexeyAB/darknet/commit/d4b2ed9d22210d33e438d9557e976f8053f1cf9b
So for layer with train_only_bn=1
and all previous layers will be trained only Batch-normalization.
I think we should use both stopbackward=2000
and train_only_bn=1
for the last backbone-layer, because randomly initialized weights in neck & head will generate random deltas for the first 2000 iterations, it will degrades both conv-weights and bn-params.
@AlexeyAB
OK, I will design new experiments after next monday.
Training with adam=1 and a lower learning rate (0.0001 works for me, but I'm on a custom dataset) seems to work fine as well without any further adjustments due to its adaptive learning rates. May be worth a try.
Hello! At some point of training I get negative GIOU, huge iou_loss and then NAN average loss. I get them on modified version of .cfg, default is trained well on the same dataset. I removed last yolo layer and made five anchors for each of the rest. Am I doing something wrong?