Try to train fast (grouped-conv) versions of csdarknet53 and csdarknet19

AlexeyAB commented 4 years ago

@WongKinYiu Hi,

Since CSPDarkNet53 is better than CSPResNeXt50 for Detector, try to train these 4 models:

Model	GPU	256x256	512x512	608x608
darknet53.cfg (original)	RTX 2070	113	56	38
csdarknet53.cfg (original)	RTX 2070	101	57	41
csdarknet53g.cfg.txt	RTX 2070	122	64	46
csdarknet53ghr.cfg.txt	RTX 2070	100	75	57
spinenet49.cfg.txt low priority	RTX 2070	49	44	43
csdarknet19-fast.cfg.txt	RTX 2070	213	149	116

csdarknet19-fast.cfg contains DropBlock, so use the latest version of Darknet that uses fast random-functions for DropBlock.

WongKinYiu commented 4 years ago

@AlexeyAB Thanks,

I will get free gpus after finish training local_avgpool models.

AlexeyAB commented 4 years ago

@WongKinYiu Hi,

It seems that the (Mosaic/Smooth/Mish) affects the model CSPResNeXt-50 better than the models CSPResNet-50 and CSPDarknet-53 for Classifier 256x256, because CSPResNeXt-50 has more outputs of each layer https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/imagenet/results.md
But CSPDarkNet53-PANet-SPP is better than CSPResNeXt50-PANet-SPP and CSPResNet50-PANet-SPP for Detector 512x512, because CSPDarkNet53-PANet-SPP has more parameters and layers https://github.com/AlexeyAB/darknet/issues/4406#issuecomment-570742118

So you should combine two networks with more layers and more parameters (CSPDarknet-53) + more outputs (CSPResNeXt-50)

It seems that the model CSPResNeXt50 has higher Top1/Top5 because it has more outputs for each layer out_w * out_h * out_c (i.e. it has higher filters= in [conv] layers).
258 291 (1.1x) CSPresnext50 / 233 348 (1.0x) CSPDarknet53.
It seems that although a small number of parameters ~21M (CSPResNet50/CSPResNext50) = [conv] groups = 32 and 84 layers are sufficient for the 256x256 Classifier, but a much more number of parameters ~27M (CSPDarkNet53) [conv] groups = 1 and 108 layers is needed for the 512x512 Detector.

Suggestion:

Either increase number of filters in CSPDarknet53 (and also increase groups= from 1 to 2 - 8 for layers with high value of filters)
or increase number of layers in CSPresnext50 (and also decrease groups= from 32 to 1 - 8)

Model	Num of layers	G r o u p s	Para meters	average outputs = `out_w X out_h X out_c`	RTX 2070 FPS	b f l o p s	Top1 / Top5	Top1 / Top5 (mosaic + label smooth + mish)	AP (Detector) 512x512
CSP DarkNet53 256x256	108	1	27	233 348 (1.0x)	125	13	77.2% / 93.6%	78.7% / 94.8%	38.7 %
CSP ResNeXt50 256x256	84	32	20	258 291 (1.1x)	72	8	77.9% / 94.0%	79.8% / 95.2%	38.0 %
CSP ResNet50 256x256	84	1	21	203 665 (0.87x)	168	9	76.6% / 93.3%	78.1% / 94.2%	38.0 %

Did you try to train with DropBlock, does it work well?

WongKinYiu commented 4 years ago

@AlexeyAB

Yes, If I change output channel of CSPResNet50 and CSPDarknet53 to 2048, I think it can achieve better results but with large amount of computation.

Do you need an ImageNet pre-trained model which has more layers + more parameters + more outputs ? If yes, I can train a model. Or if you have a cfg file, I will get 2 free gpus tomorrow for training it.

The DropBlock models are still training. Currently, the models get a little bit lower accuracy then the models without DropBlock at same epoch. But it may because DropBlock need more epochs to get converge.

WongKinYiu commented 4 years ago

@AlexeyAB Hello,

The model with DropBlock gets lower accuracy than without it (79.8 vs 79.1). I think we need follow what EfficientNet do - reduce drop probability during training.

AlexeyAB commented 4 years ago

@WongKinYiu Hi,

This is already done: https://github.com/AlexeyAB/darknet/blob/d51d89053afc4b7f50a30ace7b2fcf1b2ddd7598/src/dropout_layer_kernels.cu#L28-L31

May be we should increase drop probability during whole training instead of half the training process
Or may be drop-block requires more parameters in the model, since drop-block/out/connect divides the model into an ensemble of many models, each of which turns out to be too small

So we should try:

new models more layers + more parameters + more outputs
fast cuda-implementation of drop-block

AlexeyAB commented 4 years ago

@WongKinYiu Hi,

Do you need an ImageNet pre-trained model which has more layers + more parameters + more outputs ? If yes, I can train a model. Or if you have a cfg file, I will get 2 free gpus tomorrow for training it.

Try to train please these 2 models - both use: MISH + mosaic=1 cutmix=1 label_smooth_eps=0.1 + reduced groups= for faster inference

csresnext50morelayers.cfg.txt - added more layers between 1st and 2nd subsamling
csresnext50sub.cfg.txt - added more layers between 1st and 2nd subsamling, and concatenated 2 subsamlings [conv] stride=2 and [maxpool] stride=2

Also did you try to train CSPResNeXt-50+Elastic with MISH + mosaic=1 cutmix=1 label_smooth_eps=0.1 ?

Also did you try to train spinenet49.cfg.txt ?

WongKinYiu commented 4 years ago

OK, will train these two models.

No, the inference speed of CSPResNeXt-50 with Elastic is too slow. I think it can not run real time object detection.

AlexeyAB commented 4 years ago

@WongKinYiu

All with MISH-activation and 608x608 network resolution on GeForce RTX 2070:

csresnext50.cfg - 51.2 FPS
csdarknet53.cfg - 53.6 FPS
csresnext50morelayers.cfg.txt - 44.0 FPS
csresnext50sub.cfg.txt - 43.3 FPS
spinenet49.cfg - 43.2 FPS (640x640 network resolution)
elastic-csresnext50.cfg - 34.7 FPS (576x576 network resolution)

So it may make sense to train the model spinenet49.cfg with MISH + mosaic=1 cutmix=1 label_smooth_eps=0.1 spinenet49.cfg.txt

WongKinYiu commented 4 years ago

@AlexeyAB

I have only two free gpus currently, will train csresnext50morelayers and spinenet49 first.

WongKinYiu commented 4 years ago

@AlexeyAB

Model	CutMix	Mosaic	Label Smoothing	Mish	Top-1	Top-5
SpineNet-49	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	78.3%	94.6%

AlexeyAB commented 4 years ago

@WongKinYiu Thanks!

So SpineNet-49 is worse than csdarknet53 and csresnext50 at least for ImageNet

spinenet49.cfg.txt - 43.2 FPS - 78.3% | 94.6%
csdarknet53.cfg - 53.6 FPS - 78.7% | 94.8%
csresnext50.cfg - 51.2 FPS - 79.8% | 95.2%

Also I fixed label_smoothing for Detector (not for Classifier) https://github.com/AlexeyAB/darknet/commit/81290b07376c5abb4988a492dda70913bb90133d in such a way as there: https://github.com/david8862/keras-YOLOv3-model-set/blob/6cc297434e0604e2f6c34a8a2557b342468f083a/yolo3/loss.py#L225-L227 For such a probability transformation http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiJ4KjAuOSswLjA1IiwiY29sb3IiOiIjMDAwMDAwIn0seyJ0eXBlIjoxMDAwLCJ3aW5kb3ciOlsiMCIsIjEiLCIwIiwiMSJdfV0-

So you can try to train Detector with new label_smoothing.

Usage

[yolo]
label_smooth_eps=0.1

for each [yolo] layer

Since old label_smoothing worked well for the Classifier, but worked bad for the Detector.

WongKinYiu commented 4 years ago

The results in original paper.

CSPDarkNet-53 has more parameters and FLOPs.

AlexeyAB commented 4 years ago

Yes, SpineNet-49 has fewer params and flops, but CSPDarkNet-53 faster and more accurate for Classifier. But may be SpineNet-49 more accurate for Detector.

WongKinYiu commented 4 years ago

@AlexeyAB

Model	CutMix	Mosaic	Label Smoothing	Mish	Top-1	Top-5
CSPResNeXt-50-morelayers	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	79.4%	95.2%

AlexeyAB commented 4 years ago

@WongKinYiu Thanks! Do you mean csresnext50morelayers.cfg or CSPDarkNet-53-morelayers? https://github.com/WongKinYiu/CrossStagePartialNetworks/issues/6#issuecomment-584406057

WongKinYiu commented 4 years ago

@AlexeyAB Oh, sorry, it is csresnext50morelayers.cfg.

AlexeyAB commented 4 years ago

@WongKinYiu So csresnext50morelayers.cfg is worse than csresnext50.cfg (Top1 79.4% vs 79.8%) on ImageNet. https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/imagenet/results.md

But I think csresnext50morelayers.cfg will be better as backbone for Detector.

WongKinYiu commented 4 years ago

@AlexeyAB

Yes, csresnext50 performs better on ImageNet.

I will get a free gpu after about 4 days. However, currently I do not have results of backbone with mish activation on MSCOCO, could you help for designing the cfg for detector with csresnext50morelayers backbone?

Thanks.

AlexeyAB commented 4 years ago

@WongKinYiu Ok, I can make 2 cfg-files, with [net] mosaic=1 dynamic_minibatch=1 and mish-activation:

csresnext50morelayers + SPP_PAN
csresnext50morelayers + SPP+ASFF+BiFPN

Will we try to test new label_smoothing for Detector? When will the CBN, DropBlock, ASFF and BiFPN model training end approximately?

WongKinYiu commented 4 years ago

@AlexeyAB

for classifier, cbn will finish in one week, cbn+dropblock still very slow, i think it need more than one month to finish training.

for detector, rfb+bn need about two weeks, cbn need about two weeks, bifpn need about three to four weeks, but the training may stop several days or weeks, asff not yet start.

i will also do ablation study for dynamic_minibatch and new label_smoothing.

glenn-jocher commented 4 years ago

@AlexeyAB @WongKinYiu have you had any success with label smoothing? I just learned about it recently, but was confused about a few things:

Can it be applied in both classification models and object detection models?
Is it always applied to both negative samples (i.e. 0.1) and positive samples (i.e. 0.9) , or could it be applied only to negatives etc?
For object detection, should it be applied to both objectness loss and classification loss?
Can it be applied to either CEloss and BCEloss criteria?

WongKinYiu commented 4 years ago

@glenn-jocher

Yes, it can be applied on classification head of detectors - BoF.
I think both okay, because it can be applied on both of YOLOv3 and FasterRCNN.
In BoF paper, it seems only applied on classification head.
I think yes, the paper mentioned "In the case of sigmoid outputs of range 0 to 1.0 as in YOLOv3 [16], label smoothing is even simpler by correcting the upper and lower limit of the range of targets as in Eq. 3."

But unfortunately, all of mixup, cosine lr, and label smoothing get worse results in my experiments.

glenn-jocher commented 4 years ago

@WongKinYiu ah thanks, that's super informative!

That solves a big mystery for me then. I tried to apply it to both obj loss and class loss at the same time, and it destroyed my NMS because every anchor single was above threshold (of 0.001).

I implemented cosine lr scheduler a couple weeks ago, it worked well (+0.3 mAP) though I noticed it worked better if I raised the initial LR. Before with the traditional step scheduler I was using about lr0=0.006, now with the cosine scheduler I use lr0=0.010 to get that +0.3 increase on COCO.

Name	mAP@0.5	mAP@0.5-0.95	Comments
(288-640)-608 to 273 bs16a4 yolov3-spp.cfg	61.6	41.6	step lr
(288-640)-608 to 273 bs16a4 yolov3-spp.cfg	61.8	41.9	cos lr0=0.01

glenn-jocher commented 4 years ago

@WongKinYiu see https://github.com/ultralytics/yolov3/issues/238#issuecomment-593611986 for the cosine scheduler implementation. These are the training plots for the two runs (step and cos lr). Interestingly the val losses are better at the end with step, and you can see cos obj loss is starting to overtrain at the end, but the cos final mAP is still slightly higher. I'm not quite sure what that means.

results

glenn-jocher commented 4 years ago

@WongKinYiu do you know what the value of epsilon should be in eqn3 of the BoF paper? If I assume epsilon=0.1 the classification target values (after a sigmoid) would be

positive: (1 - 0.1) = 0.9
negative: 0.1/(80-1) = 0.0013

Does that seem right??

WongKinYiu commented 4 years ago

@glenn-jocher https://github.com/dmlc/gluon-cv/blob/master/gluoncv/model_zoo/yolo/yolo_target.py#L268-L273

glenn-jocher commented 4 years ago

In their case they seem to be using epsilon as smooth_weight, with a constraint to keep it getting too large if the class count is low. Ok, I'll start from there. smooth_weight = min(1. / self._num_class, 1. / 40)

WongKinYiu commented 4 years ago

It seems only YOLOv3 can apply label smooth. All of SSD, CenterNet, FasterRCNN, MaskRCNN do not have label smooth function.

AlexeyAB commented 4 years ago

@glenn-jocher

I implemented cosine lr scheduler a couple weeks ago, it worked well (+0.3 mAP) though I noticed it worked better if I raised the initial LR. Before with the traditional step scheduler I was using about lr0=0.006, now with the cosine scheduler I use lr0=0.010 to get that +0.3 increase on COCO.

So in terms of Darknet, instead of

learning_rate=0.00261
burn_in=1000
max_batches = 500500
policy=steps
steps=400000,450000
scales=.1,.1

now you use

learning_rate=0.01
burn_in=1000
max_batches = 500500
policy=sgdr

@WongKinYiu So we can try it too.

AlexeyAB commented 4 years ago

@glenn-jocher

Why does MAP grow sharply at the very end?

AlexeyAB commented 4 years ago

@WongKinYiu

for classifier, cbn will finish in one week, cbn+dropblock still very slow, i think it need more than one month to finish training.

for detector, rfb+bn need about two weeks, cbn need about two weeks, bifpn need about three to four weeks, but the training may stop several days or weeks, asff not yet start.

i will also do ablation study for dynamic_minibatch and new label_smoothing.

Thanks!

bifpn need about three to four weeks, but the training may stop several days or weeks, asff not yet start.

Is it due to memory leak bug in some platforms?

Also did you start training csresnext50-ws-mi2.cfg.txt weighted-shortcut? https://github.com/AlexeyAB/darknet/issues/4498#issuecomment-592191368 Training these models is not very slow? weighted-shortcut (csresnext50-ws-mi2.cfg.txt and csresnext50-ws.cfg.txt and csdarknet53-ws.cfg.txt) and BiFPN (csdarknet53-bifpn-optimal.cfg.txt and csresnext50-bifpn-optimal.cfg.txt)

WongKinYiu commented 4 years ago

@AlexeyAB

Is it due to memory leak bug in some platforms?

no, it is because i turn back gpus to my friend. the current training is on cloud gpus.

Also did you start training csresnext50-ws-mi2.cfg.txt weighted-shortcut?

yes, csresnext50-ws-mi2, csresnext50-ws, and csdarknet53-ws are under training. these models need about two weeks to finish training.

glenn-jocher commented 4 years ago

@AlexeyAB the mAP spikes at the end because of a decision I made to compute mAP at only 0.1 conf threshold for all of training, except for the last epoch, which I compute at the usual 0.001 conf threshold. I did this to speed up mAP computation during training, but this is confusing the hell out of everyone (naturally), so a couple weeks ago I finally did away with the practice, and now I compute all mAPs at 0.001 conf. So basically, mAP does not spike at the end, rather it is underrepresented up until the end.

I probably owe an apology to anyone who's ever had to look at one of my plots and wonder the same question (sorry!).

glenn-jocher commented 4 years ago

@glenn-jocher

I implemented cosine lr scheduler a couple weeks ago, it worked well (+0.3 mAP) though I noticed it worked better if I raised the initial LR. Before with the traditional step scheduler I was using about lr0=0.006, now with the cosine scheduler I use lr0=0.010 to get that +0.3 increase on COCO.

So in terms of Darknet, instead of
learning_rate=0.00261
burn_in=1000
max_batches = 500500
policy=steps
steps=400000,450000
scales=.1,.1
now you use
learning_rate=0.01
burn_in=1000
max_batches = 500500
policy=sgdr
@WongKinYiu So we can try it too.

@AlexeyAB yes exactly, along with momentum 0.937. The LR multiple should look like this over the training (not including the burnin):

AlexeyAB commented 4 years ago

@glenn-jocher Thanks! Did you try ASFF?

@WongKinYiu Thanks! It seems that cosine(sgdr)-learning policy requires higher initial lr = 0.01 for Detector.

WongKinYiu commented 4 years ago

@AlexeyAB OK, i will do ablation study for:

dynamic_minibatch
label_smoothing
sgdr

glenn-jocher commented 4 years ago

@AlexeyAB that's right I need to try ASFF still. Sometimes its a bit complicated importing PyTorch modules from other repos. I've opened an issue https://github.com/ruinmessi/ASFF/issues/72 there to see if they have an available cfg file, but if not I will try and do it the hard way this coming week.

It looks like stripping away everything else they show a +1.8 mAP bump from ASFF https://github.com/ruinmessi/ASFF#coco, 38.8 to 40.6.

glenn-jocher commented 4 years ago

@AlexeyAB ok I've read the ASFF paper, which confused me a bit, but then I found this comment, which I think tried to explain simply https://github.com/ruinmessi/ASFF/issues/51#issuecomment-577445717 but was not quite right.

I think the correct implementation of ASFF is this. An example 256x416 image, the 3 yolo outputs (80-class, 3-anchor) with rectangular inference are: Layer 89/114 yolo0 - [1, 255, 8, 13] Layer 101/114 yolo1 - [1, 255, 16, 26] Layer 113/114 yolo2 - [1, 255, 32, 52]

The ASFF weight vectors are w0(1,255,1,1), w1(1,255,1,1), w2(1,255,1,1), with the constraints that the weights sum to 1 (across the yolo layers) and range between 0-1. This is very similar to BiFPN then, which simply applies a scalar weight instead for feature fusion: w0(1,), w1(1,), w2(1,).

The ASFF output for example at yolo0 would = yolo0 * w0 + yolo1 * rescale(w1) + yolo2 * rescale(w3)

ASFF proposes a softmax activation on the weights while BiFPN proposes a (faster/worse) softmax approximation. I think I can implement this this week on yolov3-spp.cfg.

AlexeyAB commented 4 years ago

@glenn-jocher The main difference, for 3 branches:

ASFF - coefficients [w, h, 3] are calculated during inference and uses Softmax
BiFPN - coefficients [1, 1, 3] are calculated during training and uses ReLU

glenn-jocher commented 4 years ago

@AlexeyAB ah ok I totally misunderstood then. How about this? Or where exactly would the ASFF-related weights be output?

glenn-jocher commented 4 years ago

@AlexeyAB ah, or instead 19x19x3 like the exact dimensions you said. Then yolov1 and yolov2 would be i.e. 38x38x3 and 76x76x3. Ok yes I think I see now.

One change to make this happen is that the entire network needs to run first before any of the output layers. In PyTorch right now the layers run in sequential order, so yolo0 output is calculated before the rest of the convolutions for yolo1 etc.

AlexeyAB commented 4 years ago

@glenn-jocher Look at this cfg-file with: fixed-BiFPN+ASFF+RFB+DropBlock+csresnext50morelayers_backbone: csresnext50morelayers-spp-asff-bifpn-rfb-db.cfg.txt

that tries to use label_smoothing, dynamic_minibatch, sgdr, and avoids iou_thresh.

glenn-jocher commented 4 years ago

@AlexeyAB wow, that cfg sounds like its packing all the latest goodies. Have you gotten dropblock to work well? RFB I've seen before but haven't looked into it, though the ASFF paper did mention they used RFB + DB for their best results.

LukeAI commented 4 years ago

@AlexeyAB ah ok I totally misunderstood then. How about this? Or where exactly would the ASFF-related weights be output?

what's this software? it can visualise arbitrary darknet .cfgs?

kossolax commented 4 years ago

it's netron: https://lutzroeder.github.io/netron/

AlexeyAB commented 4 years ago

@glenn-jocher

Have you gotten dropblock to work well?

I have fixed DropBlock, but haven't tested it on large datasets.

AlexeyAB commented 4 years ago

@WongKinYiu

cbn+dropblock still very slow, i think it need more than one month to finish training.

I sligtly improved speed of DropBlock in the last commit, so you can continue training of DropBlock with new code, for faster training.

asff not yet start.

I think its better to wait for csresnext50sub, and train this model https://github.com/WongKinYiu/CrossStagePartialNetworks/issues/6#issuecomment-596849834 with csresnext50sub-backbonne

WongKinYiu commented 4 years ago

@AlexeyAB Thanks.

WongKinYiu commented 4 years ago

@AlexeyAB

Model	CutMix	Mosaic	Label Smoothing	Mish	Top-1	Top-5
CSPResNeXt-50-sub weights	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	:heavy_check_mark:	79.5%	95.3%

WongKinYiu commented 4 years ago

@AlexeyAB csresnext50sub-spp-asff-bifpn-rfb-db need too much memory. it can only be trained with batch size equal to 64/64 if use default setting, i train it with size 416 with 64/32.

update: always gets nan in 1k iterations.

WongKinYiu / CrossStagePartialNetworks

Try to train fast (grouped-conv) versions of csdarknet53 and csdarknet19 #6