Open AlexeyAB opened 4 years ago
@AlexeyAB Thanks,
I will get free gpus after finish training local_avgpool
models.
@WongKinYiu Hi,
It seems that the (Mosaic/Smooth/Mish) affects the model CSPResNeXt-50 better than the models CSPResNet-50 and CSPDarknet-53 for Classifier 256x256, because CSPResNeXt-50 has more outputs of each layer https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/imagenet/results.md
But CSPDarkNet53-PANet-SPP is better than CSPResNeXt50-PANet-SPP and CSPResNet50-PANet-SPP for Detector 512x512, because CSPDarkNet53-PANet-SPP has more parameters and layers https://github.com/AlexeyAB/darknet/issues/4406#issuecomment-570742118
So you should combine two networks with more layers and more parameters (CSPDarknet-53)
+ more outputs (CSPResNeXt-50)
It seems that the model CSPResNeXt50 has higher Top1/Top5 because it has more outputs for each layer out_w * out_h * out_c
(i.e. it has higher filters=
in [conv] layers).
258 291 (1.1x) CSPresnext50 / 233 348 (1.0x) CSPDarknet53.
It seems that although a small number of parameters ~21M (CSPResNet50/CSPResNext50) = [conv] groups = 32 and 84 layers
are sufficient for the 256x256 Classifier, but a much more number of parameters ~27M (CSPDarkNet53) [conv] groups = 1 and 108 layers
is needed for the 512x512 Detector.
Suggestion:
Model | Num of layers | G r o u p s | Para meters | average outputs = out_w X out_h X out_c |
RTX 2070 FPS | b f l o p s | Top1 / Top5 | Top1 / Top5 (mosaic + label smooth + mish) | AP (Detector) 512x512 |
---|---|---|---|---|---|---|---|---|---|
CSP DarkNet53 256x256 | 108 | 1 | 27 | 233 348 (1.0x) | 125 | 13 | 77.2% / 93.6% | 78.7% / 94.8% | 38.7 % |
CSP ResNeXt50 256x256 | 84 | 32 | 20 | 258 291 (1.1x) | 72 | 8 | 77.9% / 94.0% | 79.8% / 95.2% | 38.0 % |
CSP ResNet50 256x256 | 84 | 1 | 21 | 203 665 (0.87x) | 168 | 9 | 76.6% / 93.3% | 78.1% / 94.2% | 38.0 % |
Did you try to train with DropBlock, does it work well?
@AlexeyAB
Yes, If I change output channel of CSPResNet50 and CSPDarknet53 to 2048, I think it can achieve better results but with large amount of computation.
Do you need an ImageNet pre-trained model which has more layers + more parameters + more outputs
? If yes, I can train a model. Or if you have a cfg file, I will get 2 free gpus tomorrow for training it.
The DropBlock models are still training. Currently, the models get a little bit lower accuracy then the models without DropBlock at same epoch. But it may because DropBlock need more epochs to get converge.
@AlexeyAB Hello,
The model with DropBlock gets lower accuracy than without it (79.8 vs 79.1). I think we need follow what EfficientNet do - reduce drop probability during training.
@WongKinYiu Hi,
This is already done: https://github.com/AlexeyAB/darknet/blob/d51d89053afc4b7f50a30ace7b2fcf1b2ddd7598/src/dropout_layer_kernels.cu#L28-L31
May be we should increase drop probability during whole training instead of half the training process
Or may be drop-block requires more parameters in the model, since drop-block/out/connect divides the model into an ensemble of many models, each of which turns out to be too small
So we should try:
more layers + more parameters + more outputs
@WongKinYiu Hi,
Do you need an ImageNet pre-trained model which has more layers + more parameters + more outputs ? If yes, I can train a model. Or if you have a cfg file, I will get 2 free gpus tomorrow for training it.
Try to train please these 2 models - both use: MISH
+ mosaic=1 cutmix=1 label_smooth_eps=0.1
+ reduced groups=
for faster inference
csresnext50morelayers.cfg.txt - added more layers between 1st and 2nd subsamling
csresnext50sub.cfg.txt - added more layers between 1st and 2nd subsamling, and concatenated 2 subsamlings [conv] stride=2
and [maxpool] stride=2
Also did you try to train CSPResNeXt-50+Elastic
with MISH
+ mosaic=1 cutmix=1 label_smooth_eps=0.1
?
Also did you try to train spinenet49.cfg.txt
?
OK, will train these two models.
No, the inference speed of CSPResNeXt-50
with Elastic
is too slow.
I think it can not run real time object detection.
@WongKinYiu
All with MISH-activation and 608x608 network resolution on GeForce RTX 2070:
csresnext50.cfg - 51.2 FPS
csdarknet53.cfg - 53.6 FPS
csresnext50morelayers.cfg.txt - 44.0 FPS
csresnext50sub.cfg.txt - 43.3 FPS
spinenet49.cfg - 43.2 FPS (640x640 network resolution)
elastic-csresnext50.cfg - 34.7 FPS (576x576 network resolution)
So it may make sense to train the model spinenet49.cfg
with MISH
+ mosaic=1 cutmix=1 label_smooth_eps=0.1
spinenet49.cfg.txt
@AlexeyAB
I have only two free gpus currently, will train csresnext50morelayers
and spinenet49
first.
@AlexeyAB
Model | CutMix | Mosaic | Label Smoothing | Mish | Top-1 | Top-5 |
---|---|---|---|---|---|---|
SpineNet-49 | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | 78.3% | 94.6% |
@WongKinYiu Thanks!
So SpineNet-49 is worse than csdarknet53 and csresnext50 at least for ImageNet
Also I fixed label_smoothing for Detector (not for Classifier) https://github.com/AlexeyAB/darknet/commit/81290b07376c5abb4988a492dda70913bb90133d in such a way as there: https://github.com/david8862/keras-YOLOv3-model-set/blob/6cc297434e0604e2f6c34a8a2557b342468f083a/yolo3/loss.py#L225-L227 For such a probability transformation http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiJ4KjAuOSswLjA1IiwiY29sb3IiOiIjMDAwMDAwIn0seyJ0eXBlIjoxMDAwLCJ3aW5kb3ciOlsiMCIsIjEiLCIwIiwiMSJdfV0-
So you can try to train Detector with new label_smoothing.
Usage
[yolo]
label_smooth_eps=0.1
for each [yolo] layer
Since old label_smoothing worked well for the Classifier, but worked bad for the Detector.
The results in original paper.
CSPDarkNet-53 has more parameters and FLOPs.
Yes, SpineNet-49 has fewer params and flops, but CSPDarkNet-53 faster and more accurate for Classifier. But may be SpineNet-49 more accurate for Detector.
@AlexeyAB
Model | CutMix | Mosaic | Label Smoothing | Mish | Top-1 | Top-5 |
---|---|---|---|---|---|---|
CSPResNeXt-50-morelayers | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | 79.4% | 95.2% |
@WongKinYiu Thanks! Do you mean csresnext50morelayers.cfg
or CSPDarkNet-53-morelayers? https://github.com/WongKinYiu/CrossStagePartialNetworks/issues/6#issuecomment-584406057
@AlexeyAB Oh, sorry, it is csresnext50morelayers.cfg
.
@WongKinYiu
So csresnext50morelayers.cfg
is worse than csresnext50.cfg
(Top1 79.4% vs 79.8%) on ImageNet. https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/imagenet/results.md
But I think csresnext50morelayers.cfg
will be better as backbone for Detector.
@AlexeyAB
Yes, csresnext50
performs better on ImageNet.
I will get a free gpu after about 4 days.
However, currently I do not have results of backbone with mish activation on MSCOCO, could you help for designing the cfg for detector with csresnext50morelayers
backbone?
Thanks.
@WongKinYiu
Ok, I can make 2 cfg-files, with [net] mosaic=1 dynamic_minibatch=1
and mish-activation:
Will we try to test new label_smoothing for Detector? When will the CBN, DropBlock, ASFF and BiFPN model training end approximately?
@AlexeyAB
for classifier, cbn will finish in one week, cbn+dropblock still very slow, i think it need more than one month to finish training.
for detector, rfb+bn need about two weeks, cbn need about two weeks, bifpn need about three to four weeks, but the training may stop several days or weeks, asff not yet start.
i will also do ablation study for dynamic_minibatch
and new label_smoothing
.
@AlexeyAB @WongKinYiu have you had any success with label smoothing? I just learned about it recently, but was confused about a few things:
@glenn-jocher
But unfortunately, all of mixup, cosine lr, and label smoothing get worse results in my experiments.
@WongKinYiu ah thanks, that's super informative!
That solves a big mystery for me then. I tried to apply it to both obj loss and class loss at the same time, and it destroyed my NMS because every anchor single was above threshold (of 0.001).
I implemented cosine lr scheduler a couple weeks ago, it worked well (+0.3 mAP) though I noticed it worked better if I raised the initial LR. Before with the traditional step scheduler I was using about lr0=0.006, now with the cosine scheduler I use lr0=0.010 to get that +0.3 increase on COCO.
Name | mAP@0.5 | mAP@0.5-0.95 | Comments |
---|---|---|---|
(288-640)-608 to 273 bs16a4 yolov3-spp.cfg | 61.6 | 41.6 | step lr |
(288-640)-608 to 273 bs16a4 yolov3-spp.cfg | 61.8 | 41.9 | cos lr0=0.01 |
@WongKinYiu see https://github.com/ultralytics/yolov3/issues/238#issuecomment-593611986 for the cosine scheduler implementation. These are the training plots for the two runs (step and cos lr). Interestingly the val losses are better at the end with step, and you can see cos obj loss is starting to overtrain at the end, but the cos final mAP is still slightly higher. I'm not quite sure what that means.
@WongKinYiu do you know what the value of epsilon should be in eqn3 of the BoF paper? If I assume epsilon=0.1 the classification target values (after a sigmoid) would be
Does that seem right??
In their case they seem to be using epsilon as smooth_weight
, with a constraint to keep it getting too large if the class count is low. Ok, I'll start from there.
smooth_weight = min(1. / self._num_class, 1. / 40)
It seems only YOLOv3 can apply label smooth. All of SSD, CenterNet, FasterRCNN, MaskRCNN do not have label smooth function.
@glenn-jocher
I implemented cosine lr scheduler a couple weeks ago, it worked well (+0.3 mAP) though I noticed it worked better if I raised the initial LR. Before with the traditional step scheduler I was using about lr0=0.006, now with the cosine scheduler I use lr0=0.010 to get that +0.3 increase on COCO.
So in terms of Darknet, instead of
learning_rate=0.00261
burn_in=1000
max_batches = 500500
policy=steps
steps=400000,450000
scales=.1,.1
now you use
learning_rate=0.01
burn_in=1000
max_batches = 500500
policy=sgdr
@WongKinYiu So we can try it too.
@glenn-jocher
Why does MAP grow sharply at the very end?
@WongKinYiu
for classifier, cbn will finish in one week, cbn+dropblock still very slow, i think it need more than one month to finish training.
for detector, rfb+bn need about two weeks, cbn need about two weeks, bifpn need about three to four weeks, but the training may stop several days or weeks, asff not yet start.
i will also do ablation study for
dynamic_minibatch
andnew label_smoothing
.
Thanks!
bifpn need about three to four weeks, but the training may stop several days or weeks, asff not yet start.
Is it due to memory leak bug in some platforms?
Also did you start training csresnext50-ws-mi2.cfg.txt
weighted-shortcut? https://github.com/AlexeyAB/darknet/issues/4498#issuecomment-592191368
Training these models is not very slow? weighted-shortcut (csresnext50-ws-mi2.cfg.txt and csresnext50-ws.cfg.txt and csdarknet53-ws.cfg.txt) and BiFPN (csdarknet53-bifpn-optimal.cfg.txt and csresnext50-bifpn-optimal.cfg.txt)
@AlexeyAB
Is it due to memory leak bug in some platforms?
no, it is because i turn back gpus to my friend. the current training is on cloud gpus.
Also did you start training
csresnext50-ws-mi2.cfg.txt
weighted-shortcut?
yes, csresnext50-ws-mi2
, csresnext50-ws
, and csdarknet53-ws
are under training.
these models need about two weeks to finish training.
@AlexeyAB the mAP spikes at the end because of a decision I made to compute mAP at only 0.1 conf threshold for all of training, except for the last epoch, which I compute at the usual 0.001 conf threshold. I did this to speed up mAP computation during training, but this is confusing the hell out of everyone (naturally), so a couple weeks ago I finally did away with the practice, and now I compute all mAPs at 0.001 conf. So basically, mAP does not spike at the end, rather it is underrepresented up until the end.
I probably owe an apology to anyone who's ever had to look at one of my plots and wonder the same question (sorry!).
@glenn-jocher
I implemented cosine lr scheduler a couple weeks ago, it worked well (+0.3 mAP) though I noticed it worked better if I raised the initial LR. Before with the traditional step scheduler I was using about lr0=0.006, now with the cosine scheduler I use lr0=0.010 to get that +0.3 increase on COCO.
So in terms of Darknet, instead of
learning_rate=0.00261 burn_in=1000 max_batches = 500500 policy=steps steps=400000,450000 scales=.1,.1
now you use
learning_rate=0.01 burn_in=1000 max_batches = 500500 policy=sgdr
@WongKinYiu So we can try it too.
@AlexeyAB yes exactly, along with momentum 0.937. The LR multiple should look like this over the training (not including the burnin):
@glenn-jocher Thanks! Did you try ASFF?
@WongKinYiu Thanks! It seems that cosine(sgdr)-learning policy requires higher initial lr = 0.01 for Detector.
@AlexeyAB OK, i will do ablation study for:
dynamic_minibatch
label_smoothing
sgdr
@AlexeyAB that's right I need to try ASFF still. Sometimes its a bit complicated importing PyTorch modules from other repos. I've opened an issue https://github.com/ruinmessi/ASFF/issues/72 there to see if they have an available cfg file, but if not I will try and do it the hard way this coming week.
It looks like stripping away everything else they show a +1.8 mAP bump from ASFF https://github.com/ruinmessi/ASFF#coco, 38.8 to 40.6.
@AlexeyAB ok I've read the ASFF paper, which confused me a bit, but then I found this comment, which I think tried to explain simply https://github.com/ruinmessi/ASFF/issues/51#issuecomment-577445717 but was not quite right.
I think the correct implementation of ASFF is this. An example 256x416 image, the 3 yolo outputs (80-class, 3-anchor) with rectangular inference are: Layer 89/114 yolo0 - [1, 255, 8, 13] Layer 101/114 yolo1 - [1, 255, 16, 26] Layer 113/114 yolo2 - [1, 255, 32, 52]
The ASFF weight vectors are w0(1,255,1,1), w1(1,255,1,1), w2(1,255,1,1), with the constraints that the weights sum to 1 (across the yolo layers) and range between 0-1. This is very similar to BiFPN then, which simply applies a scalar weight instead for feature fusion: w0(1,), w1(1,), w2(1,).
The ASFF output for example at yolo0 would =
yolo0 * w0 + yolo1 * rescale(w1) + yolo2 * rescale(w3)
ASFF proposes a softmax activation on the weights while BiFPN proposes a (faster/worse) softmax approximation. I think I can implement this this week on yolov3-spp.cfg.
@glenn-jocher The main difference, for 3 branches:
@AlexeyAB ah ok I totally misunderstood then. How about this? Or where exactly would the ASFF-related weights be output?
@AlexeyAB ah, or instead 19x19x3 like the exact dimensions you said. Then yolov1 and yolov2 would be i.e. 38x38x3 and 76x76x3. Ok yes I think I see now.
One change to make this happen is that the entire network needs to run first before any of the output layers. In PyTorch right now the layers run in sequential order, so yolo0 output is calculated before the rest of the convolutions for yolo1 etc.
@glenn-jocher Look at this cfg-file with: fixed-BiFPN+ASFF+RFB+DropBlock+csresnext50morelayers_backbone: csresnext50morelayers-spp-asff-bifpn-rfb-db.cfg.txt
that tries to use label_smoothing, dynamic_minibatch, sgdr, and avoids iou_thresh.
@AlexeyAB wow, that cfg sounds like its packing all the latest goodies. Have you gotten dropblock to work well? RFB I've seen before but haven't looked into it, though the ASFF paper did mention they used RFB + DB for their best results.
@AlexeyAB ah ok I totally misunderstood then. How about this? Or where exactly would the ASFF-related weights be output?
what's this software? it can visualise arbitrary darknet .cfgs?
it's netron: https://lutzroeder.github.io/netron/
@glenn-jocher
Have you gotten dropblock to work well?
I have fixed DropBlock, but haven't tested it on large datasets.
@WongKinYiu
cbn+dropblock still very slow, i think it need more than one month to finish training.
I sligtly improved speed of DropBlock in the last commit, so you can continue training of DropBlock with new code, for faster training.
asff not yet start.
I think its better to wait for csresnext50sub
, and train this model https://github.com/WongKinYiu/CrossStagePartialNetworks/issues/6#issuecomment-596849834 with csresnext50sub
-backbonne
@AlexeyAB Thanks.
@AlexeyAB
Model | CutMix | Mosaic | Label Smoothing | Mish | Top-1 | Top-5 |
---|---|---|---|---|---|---|
CSPResNeXt-50-sub weights | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | 79.5% | 95.3% |
@AlexeyAB
csresnext50sub-spp-asff-bifpn-rfb-db
need too much memory.
it can only be trained with batch size equal to 64/64 if use default setting, i train it with size 416 with 64/32.
update: always gets nan in 1k iterations.
@WongKinYiu Hi,
Since CSPDarkNet53 is better than CSPResNeXt50 for Detector, try to train these 4 models:
csdarknet19-fast.cfg
contains DropBlock, so use the latest version of Darknet that uses fast random-functions for DropBlock.