WongKinYiu / CrossStagePartialNetworks

Cross Stage Partial Networks
https://github.com/WongKinYiu/CrossStagePartialNetworks
897 stars 173 forks source link

Try to train fast (grouped-conv) versions of csdarknet53 and csdarknet19 #6

Open AlexeyAB opened 4 years ago

AlexeyAB commented 4 years ago

@WongKinYiu Hi,

Since CSPDarkNet53 is better than CSPResNeXt50 for Detector, try to train these 4 models:

Model GPU 256x256 512x512 608x608
darknet53.cfg (original) RTX 2070 113 56 38
csdarknet53.cfg (original) RTX 2070 101 57 41
csdarknet53g.cfg.txt RTX 2070 122 64 46
csdarknet53ghr.cfg.txt RTX 2070 100 75 57
spinenet49.cfg.txt low priority RTX 2070 49 44 43
csdarknet19-fast.cfg.txt RTX 2070 213 149 116

csdarknet19-fast.cfg contains DropBlock, so use the latest version of Darknet that uses fast random-functions for DropBlock.

WongKinYiu commented 4 years ago

@AlexeyAB Thanks,

I will get free gpus after finish training local_avgpool models.

AlexeyAB commented 4 years ago

@WongKinYiu Hi,

So you should combine two networks with more layers and more parameters (CSPDarknet-53) + more outputs (CSPResNeXt-50)


  1. It seems that the model CSPResNeXt50 has higher Top1/Top5 because it has more outputs for each layer out_w * out_h * out_c (i.e. it has higher filters= in [conv] layers).
    258 291 (1.1x) CSPresnext50 / 233 348 (1.0x) CSPDarknet53.

  2. It seems that although a small number of parameters ~21M (CSPResNet50/CSPResNext50) = [conv] groups = 32 and 84 layers are sufficient for the 256x256 Classifier, but a much more number of parameters ~27M (CSPDarkNet53) [conv] groups = 1 and 108 layers is needed for the 512x512 Detector.

Suggestion:

Model Num of layers G r o u p s Para meters average outputs = out_w X out_h X out_c RTX 2070 FPS b f l o p s Top1 / Top5 Top1 / Top5 (mosaic + label smooth + mish) AP (Detector) 512x512
CSP DarkNet53 256x256 108 1 27 233 348 (1.0x) 125 13 77.2% / 93.6% 78.7% / 94.8% 38.7 %
CSP ResNeXt50 256x256 84 32 20 258 291 (1.1x) 72 8 77.9% / 94.0% 79.8% / 95.2% 38.0 %
CSP ResNet50 256x256 84 1 21 203 665 (0.87x) 168 9 76.6% / 93.3% 78.1% / 94.2% 38.0 %

Did you try to train with DropBlock, does it work well?

WongKinYiu commented 4 years ago

@AlexeyAB

Yes, If I change output channel of CSPResNet50 and CSPDarknet53 to 2048, I think it can achieve better results but with large amount of computation.

Do you need an ImageNet pre-trained model which has more layers + more parameters + more outputs ? If yes, I can train a model. Or if you have a cfg file, I will get 2 free gpus tomorrow for training it.

The DropBlock models are still training. Currently, the models get a little bit lower accuracy then the models without DropBlock at same epoch. But it may because DropBlock need more epochs to get converge.

WongKinYiu commented 4 years ago

@AlexeyAB Hello,

The model with DropBlock gets lower accuracy than without it (79.8 vs 79.1). I think we need follow what EfficientNet do - reduce drop probability during training.

AlexeyAB commented 4 years ago

@WongKinYiu Hi,

This is already done: https://github.com/AlexeyAB/darknet/blob/d51d89053afc4b7f50a30ace7b2fcf1b2ddd7598/src/dropout_layer_kernels.cu#L28-L31

So we should try:

AlexeyAB commented 4 years ago

@WongKinYiu Hi,

Do you need an ImageNet pre-trained model which has more layers + more parameters + more outputs ? If yes, I can train a model. Or if you have a cfg file, I will get 2 free gpus tomorrow for training it.

Try to train please these 2 models - both use: MISH + mosaic=1 cutmix=1 label_smooth_eps=0.1 + reduced groups= for faster inference

  1. csresnext50morelayers.cfg.txt - added more layers between 1st and 2nd subsamling

  2. csresnext50sub.cfg.txt - added more layers between 1st and 2nd subsamling, and concatenated 2 subsamlings [conv] stride=2 and [maxpool] stride=2


Also did you try to train CSPResNeXt-50+Elastic with MISH + mosaic=1 cutmix=1 label_smooth_eps=0.1 ?

Also did you try to train spinenet49.cfg.txt ?

WongKinYiu commented 4 years ago

OK, will train these two models.

No, the inference speed of CSPResNeXt-50 with Elastic is too slow. I think it can not run real time object detection.

AlexeyAB commented 4 years ago

@WongKinYiu

All with MISH-activation and 608x608 network resolution on GeForce RTX 2070:

So it may make sense to train the model spinenet49.cfg with MISH + mosaic=1 cutmix=1 label_smooth_eps=0.1 spinenet49.cfg.txt

WongKinYiu commented 4 years ago

@AlexeyAB

I have only two free gpus currently, will train csresnext50morelayers and spinenet49 first.

WongKinYiu commented 4 years ago

@AlexeyAB

Model CutMix Mosaic Label Smoothing Mish Top-1 Top-5
SpineNet-49 :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: 78.3% 94.6%
AlexeyAB commented 4 years ago

@WongKinYiu Thanks!

So SpineNet-49 is worse than csdarknet53 and csresnext50 at least for ImageNet


Also I fixed label_smoothing for Detector (not for Classifier) https://github.com/AlexeyAB/darknet/commit/81290b07376c5abb4988a492dda70913bb90133d in such a way as there: https://github.com/david8862/keras-YOLOv3-model-set/blob/6cc297434e0604e2f6c34a8a2557b342468f083a/yolo3/loss.py#L225-L227 For such a probability transformation http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiJ4KjAuOSswLjA1IiwiY29sb3IiOiIjMDAwMDAwIn0seyJ0eXBlIjoxMDAwLCJ3aW5kb3ciOlsiMCIsIjEiLCIwIiwiMSJdfV0-

So you can try to train Detector with new label_smoothing.

Usage

[yolo]
label_smooth_eps=0.1

for each [yolo] layer

Since old label_smoothing worked well for the Classifier, but worked bad for the Detector.

WongKinYiu commented 4 years ago

The results in original paper.

image

CSPDarkNet-53 has more parameters and FLOPs.

AlexeyAB commented 4 years ago

Yes, SpineNet-49 has fewer params and flops, but CSPDarkNet-53 faster and more accurate for Classifier. But may be SpineNet-49 more accurate for Detector.

WongKinYiu commented 4 years ago

@AlexeyAB

Model CutMix Mosaic Label Smoothing Mish Top-1 Top-5
CSPResNeXt-50-morelayers :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: 79.4% 95.2%
AlexeyAB commented 4 years ago

@WongKinYiu Thanks! Do you mean csresnext50morelayers.cfg or CSPDarkNet-53-morelayers? https://github.com/WongKinYiu/CrossStagePartialNetworks/issues/6#issuecomment-584406057

WongKinYiu commented 4 years ago

@AlexeyAB Oh, sorry, it is csresnext50morelayers.cfg.

AlexeyAB commented 4 years ago

@WongKinYiu So csresnext50morelayers.cfg is worse than csresnext50.cfg (Top1 79.4% vs 79.8%) on ImageNet. https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/imagenet/results.md

But I think csresnext50morelayers.cfg will be better as backbone for Detector.

WongKinYiu commented 4 years ago

@AlexeyAB

Yes, csresnext50 performs better on ImageNet.

I will get a free gpu after about 4 days. However, currently I do not have results of backbone with mish activation on MSCOCO, could you help for designing the cfg for detector with csresnext50morelayers backbone?

Thanks.

AlexeyAB commented 4 years ago

@WongKinYiu Ok, I can make 2 cfg-files, with [net] mosaic=1 dynamic_minibatch=1 and mish-activation:

  1. csresnext50morelayers + SPP_PAN
  2. csresnext50morelayers + SPP+ASFF+BiFPN

Will we try to test new label_smoothing for Detector? When will the CBN, DropBlock, ASFF and BiFPN model training end approximately?

WongKinYiu commented 4 years ago

@AlexeyAB

for classifier, cbn will finish in one week, cbn+dropblock still very slow, i think it need more than one month to finish training.

for detector, rfb+bn need about two weeks, cbn need about two weeks, bifpn need about three to four weeks, but the training may stop several days or weeks, asff not yet start.

i will also do ablation study for dynamic_minibatch and new label_smoothing.

glenn-jocher commented 4 years ago

@AlexeyAB @WongKinYiu have you had any success with label smoothing? I just learned about it recently, but was confused about a few things:

WongKinYiu commented 4 years ago

@glenn-jocher

image

But unfortunately, all of mixup, cosine lr, and label smoothing get worse results in my experiments.

glenn-jocher commented 4 years ago

@WongKinYiu ah thanks, that's super informative!

That solves a big mystery for me then. I tried to apply it to both obj loss and class loss at the same time, and it destroyed my NMS because every anchor single was above threshold (of 0.001).

I implemented cosine lr scheduler a couple weeks ago, it worked well (+0.3 mAP) though I noticed it worked better if I raised the initial LR. Before with the traditional step scheduler I was using about lr0=0.006, now with the cosine scheduler I use lr0=0.010 to get that +0.3 increase on COCO.

Name mAP@0.5 mAP@0.5-0.95 Comments
(288-640)-608 to 273 bs16a4 yolov3-spp.cfg 61.6 41.6 step lr
(288-640)-608 to 273 bs16a4 yolov3-spp.cfg 61.8 41.9 cos lr0=0.01
glenn-jocher commented 4 years ago

@WongKinYiu see https://github.com/ultralytics/yolov3/issues/238#issuecomment-593611986 for the cosine scheduler implementation. These are the training plots for the two runs (step and cos lr). Interestingly the val losses are better at the end with step, and you can see cos obj loss is starting to overtrain at the end, but the cos final mAP is still slightly higher. I'm not quite sure what that means.

results

glenn-jocher commented 4 years ago

@WongKinYiu do you know what the value of epsilon should be in eqn3 of the BoF paper? If I assume epsilon=0.1 the classification target values (after a sigmoid) would be

Does that seem right??

Screen Shot 2020-03-07 at 10 09 31 PM
WongKinYiu commented 4 years ago

@glenn-jocher https://github.com/dmlc/gluon-cv/blob/master/gluoncv/model_zoo/yolo/yolo_target.py#L268-L273

glenn-jocher commented 4 years ago

In their case they seem to be using epsilon as smooth_weight, with a constraint to keep it getting too large if the class count is low. Ok, I'll start from there. smooth_weight = min(1. / self._num_class, 1. / 40)

WongKinYiu commented 4 years ago

It seems only YOLOv3 can apply label smooth. All of SSD, CenterNet, FasterRCNN, MaskRCNN do not have label smooth function.

AlexeyAB commented 4 years ago

@glenn-jocher

I implemented cosine lr scheduler a couple weeks ago, it worked well (+0.3 mAP) though I noticed it worked better if I raised the initial LR. Before with the traditional step scheduler I was using about lr0=0.006, now with the cosine scheduler I use lr0=0.010 to get that +0.3 increase on COCO.

So in terms of Darknet, instead of

learning_rate=0.00261
burn_in=1000
max_batches = 500500
policy=steps
steps=400000,450000
scales=.1,.1

now you use

learning_rate=0.01
burn_in=1000
max_batches = 500500
policy=sgdr

@WongKinYiu So we can try it too.

AlexeyAB commented 4 years ago

@glenn-jocher

results

Why does MAP grow sharply at the very end?

AlexeyAB commented 4 years ago

@WongKinYiu

for classifier, cbn will finish in one week, cbn+dropblock still very slow, i think it need more than one month to finish training.

for detector, rfb+bn need about two weeks, cbn need about two weeks, bifpn need about three to four weeks, but the training may stop several days or weeks, asff not yet start.

i will also do ablation study for dynamic_minibatch and new label_smoothing.

Thanks!

bifpn need about three to four weeks, but the training may stop several days or weeks, asff not yet start.

Is it due to memory leak bug in some platforms?

Also did you start training csresnext50-ws-mi2.cfg.txt weighted-shortcut? https://github.com/AlexeyAB/darknet/issues/4498#issuecomment-592191368 Training these models is not very slow? weighted-shortcut (csresnext50-ws-mi2.cfg.txt and csresnext50-ws.cfg.txt and csdarknet53-ws.cfg.txt) and BiFPN (csdarknet53-bifpn-optimal.cfg.txt and csresnext50-bifpn-optimal.cfg.txt)

WongKinYiu commented 4 years ago

@AlexeyAB

Is it due to memory leak bug in some platforms?

no, it is because i turn back gpus to my friend. the current training is on cloud gpus.

Also did you start training csresnext50-ws-mi2.cfg.txt weighted-shortcut?

yes, csresnext50-ws-mi2, csresnext50-ws, and csdarknet53-ws are under training. these models need about two weeks to finish training.

glenn-jocher commented 4 years ago

@AlexeyAB the mAP spikes at the end because of a decision I made to compute mAP at only 0.1 conf threshold for all of training, except for the last epoch, which I compute at the usual 0.001 conf threshold. I did this to speed up mAP computation during training, but this is confusing the hell out of everyone (naturally), so a couple weeks ago I finally did away with the practice, and now I compute all mAPs at 0.001 conf. So basically, mAP does not spike at the end, rather it is underrepresented up until the end.

I probably owe an apology to anyone who's ever had to look at one of my plots and wonder the same question (sorry!).

glenn-jocher commented 4 years ago

@glenn-jocher

I implemented cosine lr scheduler a couple weeks ago, it worked well (+0.3 mAP) though I noticed it worked better if I raised the initial LR. Before with the traditional step scheduler I was using about lr0=0.006, now with the cosine scheduler I use lr0=0.010 to get that +0.3 increase on COCO.

So in terms of Darknet, instead of

learning_rate=0.00261
burn_in=1000
max_batches = 500500
policy=steps
steps=400000,450000
scales=.1,.1

now you use

learning_rate=0.01
burn_in=1000
max_batches = 500500
policy=sgdr

@WongKinYiu So we can try it too.

@AlexeyAB yes exactly, along with momentum 0.937. The LR multiple should look like this over the training (not including the burnin): LR

AlexeyAB commented 4 years ago

@glenn-jocher Thanks! Did you try ASFF?

@WongKinYiu Thanks! It seems that cosine(sgdr)-learning policy requires higher initial lr = 0.01 for Detector.

WongKinYiu commented 4 years ago

@AlexeyAB OK, i will do ablation study for:

glenn-jocher commented 4 years ago

@AlexeyAB that's right I need to try ASFF still. Sometimes its a bit complicated importing PyTorch modules from other repos. I've opened an issue https://github.com/ruinmessi/ASFF/issues/72 there to see if they have an available cfg file, but if not I will try and do it the hard way this coming week.

It looks like stripping away everything else they show a +1.8 mAP bump from ASFF https://github.com/ruinmessi/ASFF#coco, 38.8 to 40.6.

glenn-jocher commented 4 years ago

@AlexeyAB ok I've read the ASFF paper, which confused me a bit, but then I found this comment, which I think tried to explain simply https://github.com/ruinmessi/ASFF/issues/51#issuecomment-577445717 but was not quite right.

I think the correct implementation of ASFF is this. An example 256x416 image, the 3 yolo outputs (80-class, 3-anchor) with rectangular inference are: Layer 89/114 yolo0 - [1, 255, 8, 13] Layer 101/114 yolo1 - [1, 255, 16, 26] Layer 113/114 yolo2 - [1, 255, 32, 52]

The ASFF weight vectors are w0(1,255,1,1), w1(1,255,1,1), w2(1,255,1,1), with the constraints that the weights sum to 1 (across the yolo layers) and range between 0-1. This is very similar to BiFPN then, which simply applies a scalar weight instead for feature fusion: w0(1,), w1(1,), w2(1,).

The ASFF output for example at yolo0 would = yolo0 * w0 + yolo1 * rescale(w1) + yolo2 * rescale(w3)

ASFF proposes a softmax activation on the weights while BiFPN proposes a (faster/worse) softmax approximation. I think I can implement this this week on yolov3-spp.cfg.

AlexeyAB commented 4 years ago

@glenn-jocher The main difference, for 3 branches:

glenn-jocher commented 4 years ago

@AlexeyAB ah ok I totally misunderstood then. How about this? Or where exactly would the ASFF-related weights be output?

Screen Shot 2020-03-09 at 3 04 23 PM
glenn-jocher commented 4 years ago

@AlexeyAB ah, or instead 19x19x3 like the exact dimensions you said. Then yolov1 and yolov2 would be i.e. 38x38x3 and 76x76x3. Ok yes I think I see now.

One change to make this happen is that the entire network needs to run first before any of the output layers. In PyTorch right now the layers run in sequential order, so yolo0 output is calculated before the rest of the convolutions for yolo1 etc.

AlexeyAB commented 4 years ago

@glenn-jocher Look at this cfg-file with: fixed-BiFPN+ASFF+RFB+DropBlock+csresnext50morelayers_backbone: csresnext50morelayers-spp-asff-bifpn-rfb-db.cfg.txt

that tries to use label_smoothing, dynamic_minibatch, sgdr, and avoids iou_thresh.

glenn-jocher commented 4 years ago

@AlexeyAB wow, that cfg sounds like its packing all the latest goodies. Have you gotten dropblock to work well? RFB I've seen before but haven't looked into it, though the ASFF paper did mention they used RFB + DB for their best results.

LukeAI commented 4 years ago

@AlexeyAB ah ok I totally misunderstood then. How about this? Or where exactly would the ASFF-related weights be output?

Screen Shot 2020-03-09 at 3 04 23 PM

what's this software? it can visualise arbitrary darknet .cfgs?

kossolax commented 4 years ago

it's netron: https://lutzroeder.github.io/netron/

AlexeyAB commented 4 years ago

@glenn-jocher

Have you gotten dropblock to work well?

I have fixed DropBlock, but haven't tested it on large datasets.

AlexeyAB commented 4 years ago

@WongKinYiu

cbn+dropblock still very slow, i think it need more than one month to finish training.

I sligtly improved speed of DropBlock in the last commit, so you can continue training of DropBlock with new code, for faster training.

asff not yet start.

I think its better to wait for csresnext50sub, and train this model https://github.com/WongKinYiu/CrossStagePartialNetworks/issues/6#issuecomment-596849834 with csresnext50sub-backbonne

WongKinYiu commented 4 years ago

@AlexeyAB Thanks.

WongKinYiu commented 4 years ago

@AlexeyAB

Model CutMix Mosaic Label Smoothing Mish Top-1 Top-5
CSPResNeXt-50-sub weights :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: :heavy_check_mark: 79.5% 95.3%
WongKinYiu commented 4 years ago

@AlexeyAB csresnext50sub-spp-asff-bifpn-rfb-db need too much memory. it can only be trained with batch size equal to 64/64 if use default setting, i train it with size 416 with 64/32.

update: always gets nan in 1k iterations.