Thoughts on YOLOX - Githubissues

AdamCuellar commented 3 years ago

@AlexeyAB would love to hear your thoughts in YOLOX. Looks like some of the higher performing variants available via this repo are not mentioned.

pdf: https://arxiv.org/pdf/2107.08430.pdf

AlexeyAB commented 3 years ago

Yes, YOLOX is interesting work.

YOLOX-x - 51.2% AP - 17 ms - https://arxiv.org/pdf/2107.08430.pdf
YOLOX-l - 50.0% AP - 14.5 ms - 69 FPS - https://arxiv.org/pdf/2107.08430.pdf
yolov4-csp-swish.cfg - 50.0% AP - 14 ms - 70 FPS - (it isn't well tested yet: model is trained on Pytorch, AP is tested on Pytorch, FPS is tested on Darknet) https://github.com/AlexeyAB/darknet#pre-trained-models
yolov4-csp-x-swish.cfg - 51.5% AP - 20 ms - (trained on pytorch) https://github.com/AlexeyAB/darknet#pre-trained-models
YOLOR-P6 - 52.6% AP - 20 ms - (trained on pytorch) https://github.com/WongKinYiu/yolor

AdamCuellar commented 3 years ago

@AlexeyAB

I've tested yolov4-csp-swish.cfg using darknet and got the following:

67.97 mAP@50 on COCO Val when evaluating using the map command
68.3 mAP@50 and 49.6 AP on COCO testdev2017 when evaluating through darknet and uploading to server
76.3 AVG FPS on Tesla V100 when using -benchmark flag
75.6 AVG FPS on Tesla V100 without -benchmark flag

I'm assuming the AP difference is due to running in darknet rather than PyTorch. I'd also like to test FPS using tkDNN but they currently don't support swish. If I have some time, I'll try to add it and see what kind of FPS it gets.

AlexeyAB commented 3 years ago

@AdamCuellar

Also you can try to add use_cuda_graph = 1 to the [net] section in the yolov4-csp-swish.cfg file there: https://github.com/AlexeyAB/darknet/blob/d669680879f72e58a5bc4d8de98c2e3c0aab0b62/cfg/yolov4-csp-swish.cfg#L17 and measure FPS with and without -benchmark flag, so it should be +7% faster.

Just it doesn't work for training, so comment it for training.

AdamCuellar commented 3 years ago

@AlexeyAB

As you suggested, adding use_cuda_graph increased AVG FPS: -without -benchmark flag, 75.6 -> 76.3 -with -benchmark flag, 76.3 -> 81.5

The numbers fluctuate a bit more with this setting but after a couple of trials I saw these averages more frequently.

toplinuxsir commented 3 years ago

yolox-x (51.2%) is just as same with yolov4-csp-x-swish(51.5%), So It's worth to implement with darknet ?

AlexeyAB commented 3 years ago

@toplinuxsir Just maybe we can think about some features from YOLOX.

AlexeyAB commented 3 years ago

@AdamCuellar

E2hCrxSXoAARYRQ

toplinuxsir commented 3 years ago

@AlexeyAB Great! YoloR is https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4-csp-x-swish.cfg? Is already implemented with darknet ? Thanks!

WongKinYiu commented 3 years ago

@AlexeyAB Hello, I have finish implemented 1) decoupled head, 2) anchor-free, and 3) multi positives in PyTorch version. SimOTA need to refactor the code. Will share the information after finish training.

@AdamCuellar Hello, Could you help for examine the speed of yolov4-csp-x-swish with same setting as https://github.com/AlexeyAB/darknet/issues/7928#issuecomment-883574378 ? Thanks in advance.

AdamCuellar commented 3 years ago

@WongKinYiu Yes, I will run yolov4-csp-x-swish as well. If you would like me to run the PyTorch code you just implemented I can do that as well if you provide the code.

WongKinYiu commented 3 years ago

@AdamCuellar

Thanks.

I just start training my implementation today, currently they do not converge as fast as original YOLOR. And I will implement them in Darknet compatible version if their performance are stable enough.

AdamCuellar commented 3 years ago

@AlexeyAB @WongKinYiu

yolov4-csp-x-swish with use_cuda_graph=1 : -54.6 AVG FPS without -benchmark flag -56.1 AVG FPS with -benchmark flag

@WongKinYiu Okay, let us know how it goes, I'm very interested.

WongKinYiu commented 3 years ago

(A) YOLOv4-CSP: baseline (B) A + anchor free: -1.159% (C) B + multi positive: -1.171% (D) A + decoupled head: training 258/300, maybe will +0.5%

I fell ground truth assignment problem of anchor free methods is more complex than anchor based methods.

AlexeyAB commented 3 years ago

@WongKinYiu Hi, Does it mean that these apporaches are not suitable for YOLOR, or are there some implementation issues?

Also the benefits of anchor-free approaches are not obvious. I understand why people don't want to re-calculate Anchors and re-assing them to different [yolo]-layers. But I don't see issue with hardcoded Anchors, like: 1x1, 0.5x1, 1x0.5, so the are, they increase accuracy, and you shouldn't change them or know about them.

WongKinYiu commented 3 years ago

They are suitable for YOLOR, but their implicit representation will be added in different position, I integrate them into YOLOv4-CSP for fair comparison first.

I have some ideas to improve the integration of YOLOv4 and anchor free approach, but it will be different from what YOLOX did. I will keep going to do that but not in high priority, and I will implement decoupled head in Darknet compatible version first.

liminghu commented 3 years ago

@WongKinYiu How do you think about YoLox-tiny (Yolox-nano) when compared with Yolov4-tiny?

WongKinYiu commented 3 years ago

Hello, I think YOLOv4-tiny (3l) is far faster and better than YOLOX-tiny, nanodet, PP-YOLO-tiny...

YOLOv5s: 36.7% AP, 115 fps (model inference only) YOLOXs: 39.6% AP, 102 fps (model inference only) YOLOv4s: 38.4% AP, 143 fps (end-to-end inference) YOLOv4-tiny (3l): 38.6% AP, 182 fps (end-to-end inference)

liminghu commented 3 years ago

Hello, I think YOLOv4-tiny (3l) is far faster and better than YOLOX-tiny, nanodet, PP-YOLO-tiny...

YOLOv5s: 36.7% AP, 115 fps (model inference only) YOLOXs: 39.6% AP, 102 fps (model inference only) YOLOv4s: 38.4% AP, 143 fps (end-to-end inference) YOLOv4-tiny (3l): 38.6% AP, 182 fps (end-to-end inference) @WongKinYiu Thanks for the quick reply. This is what they put in the paper:

WongKinYiu commented 3 years ago

EfficientDet: few parameters, slow speed on GPU. (3.9M, 36.4% AP) EfficientDet-Lite: more parameters, faster speed on GPU. (4.3M, 26.4% AP)

You could add EfficientDet-D0 to D3 model on this figure, and you will know what happens. https://github.com/google/automl/tree/master/efficientdet

liminghu commented 3 years ago

EfficientDet: few parameters, slow speed on GPU. (3.9M, 36.4% AP) EfficientDet-Lite: more parameters, faster speed on GPU. (4.3M, 26.4% AP)

You could add EfficientDet-D0 to D3 model on this figure, and you will know what happens. https://github.com/google/automl/tree/master/efficientdet

Got it. thanks a lot.

AdamCuellar commented 3 years ago

(A) YOLOv4-CSP: baseline (B) A + anchor free: -1.159% (C) B + multi positive: -1.171% (D) A + decoupled head: training 258/300, maybe will +0.5%

I fell ground truth assignment problem of anchor free methods is more complex than anchor based methods.

@WongKinYiu What version of your pytorch implementations do you use for implementing new ideas? I'd like to experiment and compare fairly.

@AlexeyAB I agree hard-coded anchors should work just fine. I rarely see a major improvement over the COCO anchors even with different datasets. Have you tried those values specifically (1x1, 0.5x1, 1x0.5)?

WongKinYiu commented 3 years ago

@AdamCuellar Hello, I use https://github.com/WongKinYiu/yolor/tree/paper

(A) YOLOv4-CSP: baseline (B) A + anchor free: -1.159% (C) B + multi positive: -1.171% (D) A + decoupled head: +0.509%

And I have tried those values specifically (1x1, 0.7x1.4, 1.4x0.7), as I remember, it got 0.x% lower AP.

AdamCuellar commented 3 years ago

@AdamCuellar Hello, I use https://github.com/WongKinYiu/yolor/tree/paper

(A) YOLOv4-CSP: baseline (B) A + anchor free: -1.159% (C) B + multi positive: -1.171% (D) A + decoupled head: +0.509%

And I have tried those values specifically (1x1, 0.7x1.4, 1.4x0.7), as I remember, it got 0.x% lower AP.

@WongKinYiu

Thank you. Looks like the decoupled heads doesn’t help too much. When you implement in Darknet I’d like to see how it’s effects the FPS of yolov4-csp.

AdamCuellar commented 3 years ago

@AlexeyAB @WongKinYiu

I've tested yolov4-csp-swish and yolov4-csp-x-swish on a V100 with tkDNN:

FP32:

75.49 AVG FPS for yolov4-csp-swish
50.99 AVG FPS for yolov4-csp-x-swish

FP16:

111.95 AVG FPS for yolov4-csp-swish
78.14 AVG FPS for yolov4-csp-x-swish

Edit:

Added FP16 FPS
@AlexeyAB does darknet run inference in FP32 when cudnn_half=1? If so, then darknet is faster for FP32.

AlexeyAB commented 3 years ago

@AdamCuellar Thanks! Darknet uses mixed-precision FP16/FP32 if we use CUDNN_HALF=1, so tkDNN is faster, as I expected: https://github.com/AlexeyAB/darknet#geforce-rtx-2080-ti

yolov4-csp-swish.cfg

Darknet CUDNN_HALF=1 cuda_graph=1 -benchmark flag - 82 FPS
tkDNN FP16 - 112 FPS

yolov4-csp-x-swish.cfg

Darknet CUDNN_HALF=1 cuda_graph=1 -benchmark flag - 56 FPS
tkDNN FP16 - 78 FPS

menggui1993 commented 3 years ago

Hello, I think YOLOv4-tiny (3l) is far faster and better than YOLOX-tiny, nanodet, PP-YOLO-tiny...

YOLOv5s: 36.7% AP, 115 fps (model inference only) YOLOXs: 39.6% AP, 102 fps (model inference only) YOLOv4s: 38.4% AP, 143 fps (end-to-end inference) YOLOv4-tiny (3l): 38.6% AP, 182 fps (end-to-end inference)

@WongKinYiu Hmm, I'm afraid that you made a mistake. I think yolov4-tiny(3l)'s 38.6 is map@0.5, while all the other results are map@0.5:0.95. In your paper, I found map for yolov4-tiny(3l) is 28.7%. So yolov4-tiny should not be compared with these models. Instead, it should be compared with yolox-tiny, ppyolo-tiny, nanodet.

zhiliu6 commented 3 years ago

@AdamCuellar

Thanks.

I just start training my implementation today, currently they do not converge as fast as original YOLOR. And I will implement them in Darknet compatible version if their performance are stable enough.

YOLOX batch size affects a lot. With small batches, it might not converge at all.

WongKinYiu commented 3 years ago

@AdamCuellar Hello,

[cfg] [weights] not yet finish training, but you could test the speed of yolov4-csp-decouple model using the cfg and weights.

AdamCuellar commented 3 years ago

@WongKinYiu

Which code base should I use? I tried to run test.py with your YOLOR repo and got the following error for main branch:

_line 341, in non_max_suppression i = torch.ops.torchvision.nms(boxes, scores, iouthres) RuntimeError: Trying to create tensor with negative dimension -1674348992: [-1674348992]

For paper branch I get:

_line 762, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) pickle.UnpicklingError: invalid load key, '\x00'.

Is there something I may be doing wrong? This is the command I use:

python test.py --weights yolov4-csp-decouple-epoch153.weights --img-size 640 --task test --device 1 --save-json --batch-size 32

I also add --cfg for the main branch. I've also tried 0.05 conf threshold and batch size of 1.

Darknet is running but a bit slow.

WongKinYiu commented 3 years ago

@AdamCuellar Hello, just use darknet to run as what you did in https://github.com/AlexeyAB/darknet/issues/7928#issuecomment-887503022 .

AdamCuellar commented 3 years ago

@AdamCuellar Hello, just use darknet to run as what you did in #7928 (comment) .

Okay, I will test tomorrow. I don't have access to V100 at the moment.

AdamCuellar commented 3 years ago

@WongKinYiu @AlexeyAB

yolov4-csp-decouple:

68.6 AVG FPS using Darknet with use_cuda_graph=1, no benchmark flag
70.2 AVG FPS using Darknet with use_cuda_graph=1, with benchmark flag
66.01 AVG FPS using tkDNN FP32
95.37 AVG FPS using tkDNN FP16

I think 0.5%+ AP is not worth the decrease in speed. What do you both think?

WongKinYiu commented 3 years ago

Yes, It mainly due to darknet decoder need [x y w h o c] * anchors as input. If we modify the decoder to accept [x y w h o] * anchors, [c] * anchors as input, the inference speed may be okay.

AdamCuellar commented 3 years ago

Yes, It mainly due to darknet decoder need [x y w h o c] * anchors as input. If we modify the decoder to accept [x y w h o] * anchors, [c] * anchors as input, the inference speed may be okay.

@WongKinYiu Okay I see, will you be implementing this?

WongKinYiu commented 3 years ago

If anchor free can work, there will no this issue. I will keep finding the way to make anchor free model work.

liminghu commented 3 years ago

Hello, I think YOLOv4-tiny (3l) is far faster and better than YOLOX-tiny, nanodet, PP-YOLO-tiny... YOLOv5s: 36.7% AP, 115 fps (model inference only) YOLOXs: 39.6% AP, 102 fps (model inference only) YOLOv4s: 38.4% AP, 143 fps (end-to-end inference) YOLOv4-tiny (3l): 38.6% AP, 182 fps (end-to-end inference)

@WongKinYiu Hmm, I'm afraid that you made a mistake. I think yolov4-tiny(3l)'s 38.6 is map@0.5, while all the other results are map@0.5:0.95. In your paper, I found map for yolov4-tiny(3l) is 28.7%. So yolov4-tiny should not be compared with these models. Instead, it should be compared with yolox-tiny, ppyolo-tiny, nanodet.

I tested the YOLOX-tiny:

on the validation data, YOLOX-tiny can get AP@0.5:0.95 : 0.3227, AP@0.5: 0.493.

WongKinYiu commented 3 years ago

input resolution 320x320: 28.7% AP input resolution 640x640: 38.6% AP

liminghu commented 3 years ago

For the YOLOX-tiny, the test resolution is: 416*416. according to: https://github.com/Megvii-BaseDetection/YOLOX/blob/a3f1c644aa5a2617a205c43d4b2e72e180ab6eff/exps/default/yolox_tiny.py

liminghu commented 3 years ago

For the YOLOX-tiny training, the default resolution: 640640, but it also did random size: (10,20)32, so train resolution is: from 320 to 640.

liminghu commented 3 years ago

I also tested YOLOX-Nano:

The inference resolution: 416*416 On the validation, I got mAP@0.5:0.95: 0.2387 AP@0.5: 0.39

liminghu commented 3 years ago

YOLOX-nano used depthwise = True which is the main difference.

AlexeyAB / darknet

Thoughts on YOLOX #7928