Try to train fast (grouped-conv) versions of csdarknet53 and csdarknet19

AlexeyAB commented 4 years ago

@WongKinYiu Hi,

Since CSPDarkNet53 is better than CSPResNeXt50 for Detector, try to train these 4 models:

Model	GPU	256x256	512x512	608x608
darknet53.cfg (original)	RTX 2070	113	56	38
csdarknet53.cfg (original)	RTX 2070	101	57	41
csdarknet53g.cfg.txt	RTX 2070	122	64	46
csdarknet53ghr.cfg.txt	RTX 2070	100	75	57
spinenet49.cfg.txt low priority	RTX 2070	49	44	43
csdarknet19-fast.cfg.txt	RTX 2070	213	149	116

csdarknet19-fast.cfg contains DropBlock, so use the latest version of Darknet that uses fast random-functions for DropBlock.

AlexeyAB commented 4 years ago

@WongKinYiu

Please, attach csresnext50sub-spp-asff-bifpn-rfb-db.cfg file, I will investigate why Nan occurs.
Can you show AP95 (not only AP | AP50 | AP75) for model CSPResNeXt-50 + Scale ?
Also did you test AP for yolov3-spp.cfg with scale, mosaic, IoU threshold, Genetic, Anchors for comparison with this model https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/coco/results.md#mscoco ?

WongKinYiu commented 4 years ago

@AlexeyAB

csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt
AP95: w scale 0.77%;w\o scale 0.89%.
i think i can not afford these training, could you help for training them? (current waiting queue https://github.com/WongKinYiu/CrossStagePartialNetworks/issues/6#issuecomment-596270800, asff, ...)

AlexeyAB commented 4 years ago

@WongKinYiu Ok,

Try to use new cfg-file: csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt

with the latest Darknet code.

I fixed BiFPN unstable training: https://github.com/AlexeyAB/darknet/commit/9ac401fa67132b127e585bd09f105ee3a5668261
I fixed cfg-file:
- fixed some [route] layers for ASFF
- activation=normalize_channels_softmax_maxval instead of activation=normalize_channels_softmax
- added batch_normalize=1 clip=3 for conv-layers with normalize_channels_softmax_maxval, and fix https://github.com/AlexeyAB/darknet/commit/54aa887462daef3499ae1e57d42a6b560b254899
- added max_delta=3 for each [yolo] layer
- burn_in=5000 instead of 1000
- learning_rate=0.003 instead of 0.01

Now training is much more stable.

If Nan will occur, then try to decrease learning_rate=0.001 burn_in=10000

Most increase training instability:

old BiFPN backward (fixed)
unconstrained ASFF weights (fixed clip=3)
ciou/giou-loss with unconstrained-delta (fixed max_delta=3)
high learning_rate=0.01 (decreased learning_rate=0.003)
mosaic=1

WongKinYiu commented 4 years ago

@AlexeyAB

does it mean if gpu compute capacity is > 700, it will force use cudnn_half? https://github.com/AlexeyAB/darknet/commit/1a56ef588a37f047bbac938d677c42b91a2b80bd

AlexeyAB commented 4 years ago

@WongKinYiu I added minor fix for cudnn_half: https://github.com/AlexeyAB/darknet/commit/002177de00a8f29d234eb761e277a988461f7710

Tomorrow I will provide fixed csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt model. Now it can be trained successfully. I just should test it with BN vs CBN on a small dataset.

chart

WongKinYiu commented 4 years ago

CSPResNeXt50 on imagenet: w cbn 78.1 top 1 acc; w\o cbn 78.5 top 1 acc.

AlexeyAB commented 4 years ago

@WongKinYiu

Ok, so (intra-batch-cbn) collecting statistics between minibachas inside the batch does not help, or is implemented incorrectly.

I hope dynamic_minibatch=1 works much better for Detector.

I will test ASFF without clip-paramter tonight and will provide new code and new cfg for csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt tomorow.

Did you train csresnext50-panet-spp-original-optimal.cfg on V100 with high mini-batch size, but was the classifier csresnext50.cfg trained with a standard minibatch size or with high mini-batch size?

Also did you train csresnext50-panet-spp-original-optimal.cfg with mish-activation by using CSPResNeXt-50 with CutMix + Mosaic + Label Smoothing + Mish from https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/imagenet/results.md ?

WongKinYiu commented 4 years ago

@AlexeyAB

Did you train csresnext50-panet-spp-original-optimal.cfg on V100 with high mini-batch size, but was the classifier csresnext50.cfg trained with a standard minibatch size or with high mini-batch size?

All of classification models are trained on a single 1080 ti or 2080 ti, because all of them can use same setting 128/4. For detector, I found that after CIoU, Scale Sensitivity, IoU Threshold, Greedy NMS, Mosaic Augmentation, ... are applied, mini-batch size seems has no affect on mAP. (CSPDarkNet53-PANet-SPP get 41.6/64.1/45.0 with 64/16 and 41.7/64.2/45.2 with 64/8.) So currently only models for ablation studies are trained by 16-GB-RAM V100, other models are trained by any gpus which can be set 64/16. (csresnext50sub-spp-asff-bifpn-rfb-db.cfg need 10 GB RAM with 64/32.)

Did you train csresnext50-panet-spp-original-optimal.cfg with mish-activation?

I not yet trained csresnext50-panet-spp-original-optimal.cfg with Mish activation, but I can design the cfg following your new csresnext50sub-spp-asff-bifpn-rfb-db.cfg and train it.

AlexeyAB commented 4 years ago

@WongKinYiu

For detector, I found that after CIoU, Scale Sensitivity, IoU Threshold, Greedy NMS, Mosaic Augmentation, ... are applied, mini-batch size seems has no affect on mAP. (CSPDarkNet53-PANet-SPP get 41.6/64.1/45.0 with 64/16 and 41.7/64.2/45.2 with 64/8.)

I think mosaic=1 allows to calculate mean/variance batch-norm-statistic across 4x more images, it looks like 4x higher mini_batch.

Is this accuracy for 416x416?
What AP/AP50/AP75 for 608x608?
Why is it 41.6/64.1/45.0 lower than 42.4 | 64.4 | 45.9 in the csresnext50-panet-spp-original-optimal.cfg ?
Did you try to train CSPDarkNet53-PANet-SPP with Scale + Mosaic + Genetic + Anchor like csresnext50-panet-spp-original-optimal.cfg that achieves 42.4 | 64.4 | 45.9 ? https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/coco/results.md#cspresnext-50-optimal

but I can design the cfg following your new csresnext50sub-spp-asff-bifpn-rfb-db.cfg and train it.

Tomorow I will share cfg-file with stable training on new code.

WongKinYiu commented 4 years ago

@AlexeyAB

Is this accuracy for 416x416?

train with width=416 and height=416, test with 512x512. https://github.com/ultralytics/yolov3/issues/698#issuecomment-587378576

What AP/AP50/AP75 for 608x608?

i do not test with 608x608.

Why is it 41.6/64.1/45.0 lower than 42.4 | 64.4 | 45.9 in the csresnext50-panet-spp-original-optimal.cfg ?

42.4/64.4/45.9 trained with optimal anchors for 512x512, and test with 512x512.

Did you try to train CSPDarkNet53-PANet-SPP with Scale + Mosaic + Genetic + Anchor like csresnext50-panet-spp-original-optimal.cfg that achieves 42.4 | 64.4 | 45.9 ?

need about 1~2 weeks to finish training.

AlexeyAB commented 4 years ago

@WongKinYiu Try to train this cfg-file with new Darknet code and with the highest possible width= and height=: csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt

AlexeyAB commented 4 years ago

@WongKinYiu Does it mean that CBN increases AP for Detector: https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/coco/results.md#mscoco but decreases Top for Classifier? https://github.com/WongKinYiu/CrossStagePartialNetworks/issues/6#issuecomment-599061433

WongKinYiu commented 4 years ago

yes

AlexeyAB commented 4 years ago

So you can try to train csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt with CBN

WongKinYiu commented 4 years ago

still can not converge, i will disable label smooth first.

AlexeyAB commented 4 years ago

@WongKinYiu Does it go to Nan, or what avg loss value? Do you use the latest code of Darknet? Also try to use default lr-policy but with burn_in=5000 https://github.com/AlexeyAB/darknet/blob/master/cfg/csresnext50-panet-spp-original-optimal.cfg#L18-L23

WongKinYiu commented 4 years ago

@AlexeyAB

no nan, but iou of all layers become zero. yes, i use latest code. i disable label smooth first is because i think you test the code on a single class dataset.

AlexeyAB commented 4 years ago

@WongKinYiu

no nan, but iou of all layers become zero.

About after how many iterations? And what is the loss value?

Just to know Training BiFPN will be disable before 2x burin_in iterations reached due to param burnin_update=2 Training ASFF will be disable before 3x burin_in iterations reached due to param burnin_update=3

WongKinYiu commented 4 years ago

@AlexeyAB

i test several times, all occurs in 200~1000 iterations. the loss value of the first time is around 200. after disable label smooth, currently 500 iterations, and performs normal.

AlexeyAB commented 4 years ago

@WongKinYiu What resolution do you use for csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt ? can you train with 448x448 or 512x512 batch=64 subidivisions=32 random=1 ?

WongKinYiu commented 4 years ago

still get all iou zero, now training with 448x448 instead 416x416.

update: 448x448, performs normal at 3k iterations. update: 448x448, become all zero at 3.5k iterations.

AlexeyAB commented 4 years ago

@WongKinYiu I successfully trained changed model and Get mAP50 = 7.49%: csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt there are set:

./darknet detector train cfg/coco_my.data cfg/csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt new_cfg/csresnext50sub.conv.94 -map

So it seen 160 less images than should (16 less batch size X 10 less max_batches). This is similar to trained 3000 iterations with a batch = 64. But it detects very poorly, screenshot below.

[net]
batch=4
subdivisions=2
width=320
height=320

dynamic_minibatch=0

learning_rate=0.003
burn_in=2000
max_batches = 50050
policy=sgdr

[yolo]
label_smooth_eps=0
random=1

1	2	3	bdd1	bdd2

AlexeyAB commented 4 years ago

This is similar to trained 6000 iterations with a batch = 64.

[net]
batch=8
subdivisions=4
width=416
height=416

dynamic_minibatch=0

learning_rate=0.003
burn_in=2000
max_batches = 50050
policy=sgdr

[yolo]
label_smooth_eps=0
random=0

1	2	3	bdd1	bdd2
mAP50 = 13.4% (6000 iterations b=64 s=32)

WongKinYiu commented 4 years ago

start training with

[net]
dynamic_minibatch=0

[yolo]
label_smooth_eps=0

update: failed at 200 iterations

start training with

[net]
mosaic=0

update: failed at 200 iterations

AlexeyAB commented 4 years ago

@WongKinYiu

update: failed at 200 iterations

What avg loss do you get?
Do you use the latest commit at Commits on Mar 15, 2020 ?
Try to train with dynamic_minibatch=1 label_smooth_eps=0 and policy=steps ... as for yolov3.cfg
Attach the latest version cfg-file that you tried to train.

WongKinYiu commented 4 years ago

@AlexeyAB

2xx avg loss, sometimes 6x avg loss. depend on which iteration become all zero iou.
yes, i download the code yesterday.
all of dynamic_minibatch=1, label_smooth_eps=1, policy=sgdr can work with cspdarknet53-panet-spp.
~will provide cfg about 10 hours latter.~ csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt

AlexeyAB commented 4 years ago

@WongKinYiu

Try to train:
csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt

I added stopbackward=3000 at [conv] layer with filters=2048 (last layer of backbone), so for the 1st 3000 iterations the Backbone will not be trained. And there will be no negative impact of random deltas on the Backbone (random deltas coming from random initial weights from the BiFPN/ASFF blocks) - this increases the stability of training.

If there will be instablity, then increase burn_in=5000 and stopbackward=10000 and decrease learning_rate=0.001

I also disabled policy=sgdr, because it isn't proved that it is better tha policy=steps when LR is lower than 0.01.

scale_x_y = 1.05 also theoretically should increases stability.

Diff:

...

AlexeyAB commented 4 years ago

@WongKinYiu

It seems that RFB without batch-norm is better than RFB-batchnorm. https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/coco/results.md
Also it seems additional RFB can increase accuracy only for high resolution PAN-networks like 800x800 or higher, because RFB increases receptieve field. But for 608x608 the receptieve field of final activation of ResNext50+Pannet is already enough even without RFB. Apparently the RBF creates an imbalance for ResNext50+PANet in the ratio Depth/Resolution.
Does new csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt is stable?
Also it seems CBN batch_normalize=2 is also unstable.

WongKinYiu commented 4 years ago

will start training new csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt after finish my breakfast.

yes, i think control the receptive field of one-stage detector is very important.

AlexeyAB commented 4 years ago

@WongKinYiu So I think we should train csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt with 608x608 to get an increase in accuracy from blocks SPP+(BiFPN, ASFF, RFB, ...)

WongKinYiu commented 4 years ago

OK, i ll set the batch size to 64/64 and train it.

update: still gets all zero iou.

AlexeyAB commented 4 years ago

@WongKinYiu

update: still gets all zero iou.

After how many iterations?

Try to train without CBN. I noticed that CBN worse accuracy on most of my models.

I train this cfg-file for 2300 iterations on MS COCO and don't get iou=0 or Nan loss: csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt (just to know there is max_batches = 50050 steps=40000,45000 instead of max_batches = 500500 steps=400000,450000)

label_smooth_eps=0.1, dynamic_minibatch=1, mosaic=1, BiFPN, ASFF, RFB, DropBlock - do not cause problems.

WongKinYiu commented 4 years ago

about 40 iterations, now i change 608/64/64 back to 416/64/32 and still performs normal at 1500 iterations.

update: becomes all zero at 3xxx iterations.

AlexeyAB commented 4 years ago

@WongKinYiu Nice! Do you currently train CSResNext50-PANet and CSDarknet53-PANet with Mosaic,Genetic,Mish... which are based on the best of these models https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/imagenet/results.md ?

WongKinYiu commented 4 years ago

yes, the training of CSPResNext50-PANet-SPP with csresnext50-gamma.cfg pretrained model will finish in 1~2 weeks.

AlexeyAB commented 4 years ago

@WongKinYiu Try to train with 608/64/64 + mosaic=1 dynamic_minibatch=1 label_smooth_eps=0.1 but without CBN, i.e. with batch_normalize=1

I successfully trained such model without Nan or zero-IoU: csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt

1	2

WongKinYiu commented 4 years ago

@AlexeyAB start training.

AlexeyAB commented 4 years ago

@WongKinYiu

Does BiFPN+ASFF+RFB+DB training go well without Nan/IoU=0?

WongKinYiu commented 4 years ago

@AlexeyAB

i resume training from 2k iterations several times when Nan/IoU=0 occurs, now training go to 7k iterations without Nan/IoU=0.

AlexeyAB commented 4 years ago

@WongKinYiu This is strange, since I didn't get Nan/IoU=0 at all.

WongKinYiu commented 4 years ago

@AlexeyAB hmm... i get IoU=0 3 times of this cfg, i already test previous cfg on cuda 9.0/10.0/10.1/10.2 before, all of training meet same situation.

AlexeyAB commented 4 years ago

@WongKinYiu Maybe this is a temporary phenomenon, which itself will be corrected, and which should not be paid attention to reaching ~10,000 iterations?

syjeon121 commented 4 years ago

@AlexeyAB @WongKinYiu hi i want to use cspdarknet53-panet-spp in this repo readme for custom object training

Screenshot from 2020-04-10 14-19-51

how many layers should i extract from weights file using partial?

WongKinYiu commented 4 years ago

partial to here. https://github.com/AlexeyAB/darknet/blob/master/cfg/cd53paspp-gamma.cfg#L948

sctrueew commented 4 years ago

@WongKinYiu Hi,

What pre-trained should I use for CSPDarknet53-PANet-SPP model?

Thanks

WongKinYiu commented 4 years ago

Hello,

which cfg do you want to use? and is your dataset lager than mscoco?

sctrueew commented 4 years ago

@WongKinYiu Hi,

which cfg do you want to use?

I already used CSPResNeXt50-PANet-SPP and I got a good result but the training time is high and I am going to use CSPDarknet53-PANet-SPP.

and is your dataset lager than mscoco? Yes, I have a big dataset, it's' about 300 classes and 1m images. My dataset includes traffic signs.

Which cfg is it good for this case? the accuracy is importance for me.

Thanks

WongKinYiu commented 4 years ago

for this case, you can use:

CSPDarknet53-PANet-SPP: 512x512 input/42.4 AP/64.5 AP50 [imagenet pretrained] [coco pretrained]
CSPDarknet53-PANet-SPP(Mish): 512x512 input/43.0 AP/64.9 AP50 [imagenet pretrained] [coco pretrained]

If your dataset is larger than mscoco, you can considerate using imagenet pretrained model (partial 104). If you hope the model converge quickly, you can use mscoco pretrained model (partial 135).

sctrueew commented 4 years ago

@WongKinYiu Hi,

My dataset is larger than mscoco. Can I use a 608 network size to get higher accuracy?

Thanks

WongKinYiu commented 4 years ago

does your dataset contains many small object? If yes, training with 608 network size can get higher accuracy definitely.

WongKinYiu / CrossStagePartialNetworks

Try to train fast (grouped-conv) versions of csdarknet53 and csdarknet19 #6