AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.65k stars 7.96k forks source link

Thoughts on YOLOX #7928

Open AdamCuellar opened 3 years ago

AdamCuellar commented 3 years ago

@AlexeyAB would love to hear your thoughts in YOLOX. Looks like some of the higher performing variants available via this repo are not mentioned.

pdf: https://arxiv.org/pdf/2107.08430.pdf

AlexeyAB commented 3 years ago

Yes, YOLOX is interesting work.

git_fig
AdamCuellar commented 3 years ago

@AlexeyAB

I've tested yolov4-csp-swish.cfg using darknet and got the following:

I'm assuming the AP difference is due to running in darknet rather than PyTorch. I'd also like to test FPS using tkDNN but they currently don't support swish. If I have some time, I'll try to add it and see what kind of FPS it gets.

AlexeyAB commented 3 years ago

@AdamCuellar

Also you can try to add use_cuda_graph = 1 to the [net] section in the yolov4-csp-swish.cfg file there: https://github.com/AlexeyAB/darknet/blob/d669680879f72e58a5bc4d8de98c2e3c0aab0b62/cfg/yolov4-csp-swish.cfg#L17 and measure FPS with and without -benchmark flag, so it should be +7% faster.

Just it doesn't work for training, so comment it for training.

AdamCuellar commented 3 years ago

@AlexeyAB

As you suggested, adding use_cuda_graph increased AVG FPS: -without -benchmark flag, 75.6 -> 76.3 -with -benchmark flag, 76.3 -> 81.5

The numbers fluctuate a bit more with this setting but after a couple of trials I saw these averages more frequently.

toplinuxsir commented 3 years ago

yolox-x (51.2%) is just as same with yolov4-csp-x-swish(51.5%), So It's worth to implement with darknet ?

AlexeyAB commented 3 years ago

@toplinuxsir Just maybe we can think about some features from YOLOX.

AlexeyAB commented 3 years ago

@AdamCuellar

E2hCrxSXoAARYRQ

toplinuxsir commented 3 years ago

@AlexeyAB Great! YoloR is https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4-csp-x-swish.cfg? Is already implemented with darknet ? Thanks!

WongKinYiu commented 3 years ago

image

@AlexeyAB Hello, I have finish implemented 1) decoupled head, 2) anchor-free, and 3) multi positives in PyTorch version. SimOTA need to refactor the code. Will share the information after finish training.

@AdamCuellar Hello, Could you help for examine the speed of yolov4-csp-x-swish with same setting as https://github.com/AlexeyAB/darknet/issues/7928#issuecomment-883574378 ? Thanks in advance.

AdamCuellar commented 3 years ago

@WongKinYiu Yes, I will run yolov4-csp-x-swish as well. If you would like me to run the PyTorch code you just implemented I can do that as well if you provide the code.

WongKinYiu commented 3 years ago

@AdamCuellar

Thanks.

I just start training my implementation today, currently they do not converge as fast as original YOLOR. And I will implement them in Darknet compatible version if their performance are stable enough.

AdamCuellar commented 3 years ago

@AlexeyAB @WongKinYiu

yolov4-csp-x-swish with use_cuda_graph=1 : -54.6 AVG FPS without -benchmark flag -56.1 AVG FPS with -benchmark flag

@WongKinYiu Okay, let us know how it goes, I'm very interested.

WongKinYiu commented 3 years ago

(A) YOLOv4-CSP: baseline (B) A + anchor free: -1.159% (C) B + multi positive: -1.171% (D) A + decoupled head: training 258/300, maybe will +0.5%

I fell ground truth assignment problem of anchor free methods is more complex than anchor based methods.

AlexeyAB commented 3 years ago

@WongKinYiu Hi, Does it mean that these apporaches are not suitable for YOLOR, or are there some implementation issues?

Also the benefits of anchor-free approaches are not obvious. I understand why people don't want to re-calculate Anchors and re-assing them to different [yolo]-layers. But I don't see issue with hardcoded Anchors, like: 1x1, 0.5x1, 1x0.5, so the are, they increase accuracy, and you shouldn't change them or know about them.

WongKinYiu commented 3 years ago

They are suitable for YOLOR, but their implicit representation will be added in different position, I integrate them into YOLOv4-CSP for fair comparison first.

I have some ideas to improve the integration of YOLOv4 and anchor free approach, but it will be different from what YOLOX did. I will keep going to do that but not in high priority, and I will implement decoupled head in Darknet compatible version first.

liminghu commented 3 years ago

@WongKinYiu How do you think about YoLox-tiny (Yolox-nano) when compared with Yolov4-tiny?

WongKinYiu commented 3 years ago

Hello, I think YOLOv4-tiny (3l) is far faster and better than YOLOX-tiny, nanodet, PP-YOLO-tiny...

YOLOv5s: 36.7% AP, 115 fps (model inference only) YOLOXs: 39.6% AP, 102 fps (model inference only) YOLOv4s: 38.4% AP, 143 fps (end-to-end inference) YOLOv4-tiny (3l): 38.6% AP, 182 fps (end-to-end inference)

liminghu commented 3 years ago

Hello, I think YOLOv4-tiny (3l) is far faster and better than YOLOX-tiny, nanodet, PP-YOLO-tiny...

YOLOv5s: 36.7% AP, 115 fps (model inference only) YOLOXs: 39.6% AP, 102 fps (model inference only) YOLOv4s: 38.4% AP, 143 fps (end-to-end inference) YOLOv4-tiny (3l): 38.6% AP, 182 fps (end-to-end inference) @WongKinYiu Thanks for the quick reply. This is what they put in the paper: image

WongKinYiu commented 3 years ago

EfficientDet: few parameters, slow speed on GPU. (3.9M, 36.4% AP) EfficientDet-Lite: more parameters, faster speed on GPU. (4.3M, 26.4% AP)

You could add EfficientDet-D0 to D3 model on this figure, and you will know what happens. https://github.com/google/automl/tree/master/efficientdet

liminghu commented 3 years ago

EfficientDet: few parameters, slow speed on GPU. (3.9M, 36.4% AP) EfficientDet-Lite: more parameters, faster speed on GPU. (4.3M, 26.4% AP)

You could add EfficientDet-D0 to D3 model on this figure, and you will know what happens. https://github.com/google/automl/tree/master/efficientdet

Got it. thanks a lot.

AdamCuellar commented 3 years ago

(A) YOLOv4-CSP: baseline (B) A + anchor free: -1.159% (C) B + multi positive: -1.171% (D) A + decoupled head: training 258/300, maybe will +0.5%

I fell ground truth assignment problem of anchor free methods is more complex than anchor based methods.

@WongKinYiu What version of your pytorch implementations do you use for implementing new ideas? I'd like to experiment and compare fairly.

@AlexeyAB I agree hard-coded anchors should work just fine. I rarely see a major improvement over the COCO anchors even with different datasets. Have you tried those values specifically (1x1, 0.5x1, 1x0.5)?

WongKinYiu commented 3 years ago

@AdamCuellar Hello, I use https://github.com/WongKinYiu/yolor/tree/paper

(A) YOLOv4-CSP: baseline (B) A + anchor free: -1.159% (C) B + multi positive: -1.171% (D) A + decoupled head: +0.509%

And I have tried those values specifically (1x1, 0.7x1.4, 1.4x0.7), as I remember, it got 0.x% lower AP.

AdamCuellar commented 3 years ago

@AdamCuellar Hello, I use https://github.com/WongKinYiu/yolor/tree/paper

(A) YOLOv4-CSP: baseline (B) A + anchor free: -1.159% (C) B + multi positive: -1.171% (D) A + decoupled head: +0.509%

And I have tried those values specifically (1x1, 0.7x1.4, 1.4x0.7), as I remember, it got 0.x% lower AP.

@WongKinYiu

Thank you. Looks like the decoupled heads doesn’t help too much. When you implement in Darknet I’d like to see how it’s effects the FPS of yolov4-csp.

AdamCuellar commented 3 years ago

@AlexeyAB @WongKinYiu

I've tested yolov4-csp-swish and yolov4-csp-x-swish on a V100 with tkDNN:

FP32:

FP16:

Edit:

AlexeyAB commented 3 years ago

@AdamCuellar Thanks! Darknet uses mixed-precision FP16/FP32 if we use CUDNN_HALF=1, so tkDNN is faster, as I expected: https://github.com/AlexeyAB/darknet#geforce-rtx-2080-ti


yolov4-csp-swish.cfg


yolov4-csp-x-swish.cfg

menggui1993 commented 3 years ago

Hello, I think YOLOv4-tiny (3l) is far faster and better than YOLOX-tiny, nanodet, PP-YOLO-tiny...

YOLOv5s: 36.7% AP, 115 fps (model inference only) YOLOXs: 39.6% AP, 102 fps (model inference only) YOLOv4s: 38.4% AP, 143 fps (end-to-end inference) YOLOv4-tiny (3l): 38.6% AP, 182 fps (end-to-end inference)

@WongKinYiu Hmm, I'm afraid that you made a mistake. I think yolov4-tiny(3l)'s 38.6 is map@0.5, while all the other results are map@0.5:0.95. In your paper, I found map for yolov4-tiny(3l) is 28.7%. So yolov4-tiny should not be compared with these models. Instead, it should be compared with yolox-tiny, ppyolo-tiny, nanodet.

zhiliu6 commented 3 years ago

@AdamCuellar

Thanks.

I just start training my implementation today, currently they do not converge as fast as original YOLOR. And I will implement them in Darknet compatible version if their performance are stable enough.

YOLOX batch size affects a lot. With small batches, it might not converge at all.

WongKinYiu commented 3 years ago

@AdamCuellar Hello,

[cfg] [weights] not yet finish training, but you could test the speed of yolov4-csp-decouple model using the cfg and weights.

AdamCuellar commented 3 years ago

@WongKinYiu

Which code base should I use? I tried to run test.py with your YOLOR repo and got the following error for main branch:

_line 341, in non_max_suppression i = torch.ops.torchvision.nms(boxes, scores, iouthres) RuntimeError: Trying to create tensor with negative dimension -1674348992: [-1674348992]

For paper branch I get:

_line 762, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) pickle.UnpicklingError: invalid load key, '\x00'.

Is there something I may be doing wrong? This is the command I use:

python test.py --weights yolov4-csp-decouple-epoch153.weights --img-size 640 --task test --device 1 --save-json --batch-size 32

I also add --cfg for the main branch. I've also tried 0.05 conf threshold and batch size of 1.

Darknet is running but a bit slow.

WongKinYiu commented 3 years ago

@AdamCuellar Hello, just use darknet to run as what you did in https://github.com/AlexeyAB/darknet/issues/7928#issuecomment-887503022 .

AdamCuellar commented 3 years ago

@AdamCuellar Hello, just use darknet to run as what you did in #7928 (comment) .

Okay, I will test tomorrow. I don't have access to V100 at the moment.

AdamCuellar commented 3 years ago

@WongKinYiu @AlexeyAB

yolov4-csp-decouple:

I think 0.5%+ AP is not worth the decrease in speed. What do you both think?

WongKinYiu commented 3 years ago

Yes, It mainly due to darknet decoder need [x y w h o c] * anchors as input. If we modify the decoder to accept [x y w h o] * anchors, [c] * anchors as input, the inference speed may be okay.

AdamCuellar commented 3 years ago

Yes, It mainly due to darknet decoder need [x y w h o c] * anchors as input. If we modify the decoder to accept [x y w h o] * anchors, [c] * anchors as input, the inference speed may be okay.

@WongKinYiu Okay I see, will you be implementing this?

WongKinYiu commented 3 years ago

If anchor free can work, there will no this issue. I will keep finding the way to make anchor free model work.

liminghu commented 3 years ago

Hello, I think YOLOv4-tiny (3l) is far faster and better than YOLOX-tiny, nanodet, PP-YOLO-tiny... YOLOv5s: 36.7% AP, 115 fps (model inference only) YOLOXs: 39.6% AP, 102 fps (model inference only) YOLOv4s: 38.4% AP, 143 fps (end-to-end inference) YOLOv4-tiny (3l): 38.6% AP, 182 fps (end-to-end inference)

@WongKinYiu Hmm, I'm afraid that you made a mistake. I think yolov4-tiny(3l)'s 38.6 is map@0.5, while all the other results are map@0.5:0.95. In your paper, I found map for yolov4-tiny(3l) is 28.7%. So yolov4-tiny should not be compared with these models. Instead, it should be compared with yolox-tiny, ppyolo-tiny, nanodet.

I tested the YOLOX-tiny: image

on the validation data, YOLOX-tiny can get AP@0.5:0.95 : 0.3227, AP@0.5: 0.493.

WongKinYiu commented 3 years ago

input resolution 320x320: 28.7% AP input resolution 640x640: 38.6% AP

liminghu commented 3 years ago

For the YOLOX-tiny, the test resolution is: 416*416. according to: https://github.com/Megvii-BaseDetection/YOLOX/blob/a3f1c644aa5a2617a205c43d4b2e72e180ab6eff/exps/default/yolox_tiny.py

liminghu commented 3 years ago

For the YOLOX-tiny training, the default resolution: 640640, but it also did random size: (10,20)32, so train resolution is: from 320 to 640.

liminghu commented 3 years ago

I also tested YOLOX-Nano: image

The inference resolution: 416*416 On the validation, I got mAP@0.5:0.95: 0.2387 AP@0.5: 0.39

liminghu commented 3 years ago

YOLOX-nano used depthwise = True which is the main difference.