Open AdamCuellar opened 3 years ago
Yes, YOLOX is interesting work.
@AlexeyAB
I've tested yolov4-csp-swish.cfg using darknet and got the following:
I'm assuming the AP difference is due to running in darknet rather than PyTorch. I'd also like to test FPS using tkDNN but they currently don't support swish. If I have some time, I'll try to add it and see what kind of FPS it gets.
@AdamCuellar
Also you can try to add use_cuda_graph = 1
to the [net]
section in the yolov4-csp-swish.cfg
file there: https://github.com/AlexeyAB/darknet/blob/d669680879f72e58a5bc4d8de98c2e3c0aab0b62/cfg/yolov4-csp-swish.cfg#L17
and measure FPS with and without -benchmark
flag, so it should be +7% faster.
Just it doesn't work for training, so comment it for training.
@AlexeyAB
As you suggested, adding use_cuda_graph increased AVG FPS: -without -benchmark flag, 75.6 -> 76.3 -with -benchmark flag, 76.3 -> 81.5
The numbers fluctuate a bit more with this setting but after a couple of trials I saw these averages more frequently.
yolox-x (51.2%) is just as same with yolov4-csp-x-swish(51.5%), So It's worth to implement with darknet ?
@toplinuxsir Just maybe we can think about some features from YOLOX.
@AdamCuellar
@AlexeyAB Great! YoloR is https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4-csp-x-swish.cfg? Is already implemented with darknet ? Thanks!
@AlexeyAB Hello, I have finish implemented 1) decoupled head, 2) anchor-free, and 3) multi positives in PyTorch version. SimOTA need to refactor the code. Will share the information after finish training.
@AdamCuellar Hello, Could you help for examine the speed of yolov4-csp-x-swish with same setting as https://github.com/AlexeyAB/darknet/issues/7928#issuecomment-883574378 ? Thanks in advance.
@WongKinYiu Yes, I will run yolov4-csp-x-swish as well. If you would like me to run the PyTorch code you just implemented I can do that as well if you provide the code.
@AdamCuellar
Thanks.
I just start training my implementation today, currently they do not converge as fast as original YOLOR. And I will implement them in Darknet compatible version if their performance are stable enough.
@AlexeyAB @WongKinYiu
yolov4-csp-x-swish with use_cuda_graph=1 : -54.6 AVG FPS without -benchmark flag -56.1 AVG FPS with -benchmark flag
@WongKinYiu Okay, let us know how it goes, I'm very interested.
(A) YOLOv4-CSP: baseline (B) A + anchor free: -1.159% (C) B + multi positive: -1.171% (D) A + decoupled head: training 258/300, maybe will +0.5%
I fell ground truth assignment problem of anchor free methods is more complex than anchor based methods.
@WongKinYiu Hi, Does it mean that these apporaches are not suitable for YOLOR, or are there some implementation issues?
Also the benefits of anchor-free approaches are not obvious. I understand why people don't want to re-calculate Anchors and re-assing them to different [yolo]-layers. But I don't see issue with hardcoded Anchors, like: 1x1, 0.5x1, 1x0.5, so the are, they increase accuracy, and you shouldn't change them or know about them.
They are suitable for YOLOR, but their implicit representation will be added in different position, I integrate them into YOLOv4-CSP for fair comparison first.
I have some ideas to improve the integration of YOLOv4 and anchor free approach, but it will be different from what YOLOX did. I will keep going to do that but not in high priority, and I will implement decoupled head in Darknet compatible version first.
@WongKinYiu How do you think about YoLox-tiny (Yolox-nano) when compared with Yolov4-tiny?
Hello, I think YOLOv4-tiny (3l) is far faster and better than YOLOX-tiny, nanodet, PP-YOLO-tiny...
YOLOv5s: 36.7% AP, 115 fps (model inference only) YOLOXs: 39.6% AP, 102 fps (model inference only) YOLOv4s: 38.4% AP, 143 fps (end-to-end inference) YOLOv4-tiny (3l): 38.6% AP, 182 fps (end-to-end inference)
Hello, I think YOLOv4-tiny (3l) is far faster and better than YOLOX-tiny, nanodet, PP-YOLO-tiny...
YOLOv5s: 36.7% AP, 115 fps (model inference only) YOLOXs: 39.6% AP, 102 fps (model inference only) YOLOv4s: 38.4% AP, 143 fps (end-to-end inference) YOLOv4-tiny (3l): 38.6% AP, 182 fps (end-to-end inference) @WongKinYiu Thanks for the quick reply. This is what they put in the paper:
EfficientDet: few parameters, slow speed on GPU. (3.9M, 36.4% AP) EfficientDet-Lite: more parameters, faster speed on GPU. (4.3M, 26.4% AP)
You could add EfficientDet-D0 to D3 model on this figure, and you will know what happens. https://github.com/google/automl/tree/master/efficientdet
EfficientDet: few parameters, slow speed on GPU. (3.9M, 36.4% AP) EfficientDet-Lite: more parameters, faster speed on GPU. (4.3M, 26.4% AP)
You could add EfficientDet-D0 to D3 model on this figure, and you will know what happens. https://github.com/google/automl/tree/master/efficientdet
Got it. thanks a lot.
(A) YOLOv4-CSP: baseline (B) A + anchor free: -1.159% (C) B + multi positive: -1.171% (D) A + decoupled head: training 258/300, maybe will +0.5%
I fell ground truth assignment problem of anchor free methods is more complex than anchor based methods.
@WongKinYiu What version of your pytorch implementations do you use for implementing new ideas? I'd like to experiment and compare fairly.
@AlexeyAB I agree hard-coded anchors should work just fine. I rarely see a major improvement over the COCO anchors even with different datasets. Have you tried those values specifically (1x1, 0.5x1, 1x0.5)?
@AdamCuellar Hello, I use https://github.com/WongKinYiu/yolor/tree/paper
(A) YOLOv4-CSP: baseline (B) A + anchor free: -1.159% (C) B + multi positive: -1.171% (D) A + decoupled head: +0.509%
And I have tried those values specifically (1x1, 0.7x1.4, 1.4x0.7), as I remember, it got 0.x% lower AP.
@AdamCuellar Hello, I use https://github.com/WongKinYiu/yolor/tree/paper
(A) YOLOv4-CSP: baseline (B) A + anchor free: -1.159% (C) B + multi positive: -1.171% (D) A + decoupled head: +0.509%
And I have tried those values specifically (1x1, 0.7x1.4, 1.4x0.7), as I remember, it got 0.x% lower AP.
@WongKinYiu
Thank you. Looks like the decoupled heads doesn’t help too much. When you implement in Darknet I’d like to see how it’s effects the FPS of yolov4-csp.
@AlexeyAB @WongKinYiu
I've tested yolov4-csp-swish and yolov4-csp-x-swish on a V100 with tkDNN:
FP32:
FP16:
Edit:
@AdamCuellar Thanks! Darknet uses mixed-precision FP16/FP32 if we use CUDNN_HALF=1, so tkDNN is faster, as I expected: https://github.com/AlexeyAB/darknet#geforce-rtx-2080-ti
yolov4-csp-swish.cfg
CUDNN_HALF=1
cuda_graph=1
-benchmark
flag - 82 FPSyolov4-csp-x-swish.cfg
CUDNN_HALF=1
cuda_graph=1
-benchmark
flag - 56 FPSHello, I think YOLOv4-tiny (3l) is far faster and better than YOLOX-tiny, nanodet, PP-YOLO-tiny...
YOLOv5s: 36.7% AP, 115 fps (model inference only) YOLOXs: 39.6% AP, 102 fps (model inference only) YOLOv4s: 38.4% AP, 143 fps (end-to-end inference) YOLOv4-tiny (3l): 38.6% AP, 182 fps (end-to-end inference)
@WongKinYiu Hmm, I'm afraid that you made a mistake. I think yolov4-tiny(3l)'s 38.6 is map@0.5, while all the other results are map@0.5:0.95. In your paper, I found map for yolov4-tiny(3l) is 28.7%. So yolov4-tiny should not be compared with these models. Instead, it should be compared with yolox-tiny, ppyolo-tiny, nanodet.
@AdamCuellar
Thanks.
I just start training my implementation today, currently they do not converge as fast as original YOLOR. And I will implement them in Darknet compatible version if their performance are stable enough.
YOLOX batch size affects a lot. With small batches, it might not converge at all.
@WongKinYiu
Which code base should I use? I tried to run test.py with your YOLOR repo and got the following error for main branch:
_line 341, in non_max_suppression i = torch.ops.torchvision.nms(boxes, scores, iouthres) RuntimeError: Trying to create tensor with negative dimension -1674348992: [-1674348992]
For paper branch I get:
_line 762, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) pickle.UnpicklingError: invalid load key, '\x00'.
Is there something I may be doing wrong? This is the command I use:
python test.py --weights yolov4-csp-decouple-epoch153.weights --img-size 640 --task test --device 1 --save-json --batch-size 32
I also add --cfg for the main branch. I've also tried 0.05 conf threshold and batch size of 1.
Darknet is running but a bit slow.
@AdamCuellar Hello, just use darknet to run as what you did in https://github.com/AlexeyAB/darknet/issues/7928#issuecomment-887503022 .
@AdamCuellar Hello, just use darknet to run as what you did in #7928 (comment) .
Okay, I will test tomorrow. I don't have access to V100 at the moment.
@WongKinYiu @AlexeyAB
yolov4-csp-decouple:
I think 0.5%+ AP is not worth the decrease in speed. What do you both think?
Yes, It mainly due to darknet decoder need [x y w h o c] * anchors
as input. If we modify the decoder to accept [x y w h o] * anchors, [c] * anchors
as input, the inference speed may be okay.
Yes, It mainly due to darknet decoder need
[x y w h o c] * anchors
as input. If we modify the decoder to accept[x y w h o] * anchors, [c] * anchors
as input, the inference speed may be okay.
@WongKinYiu Okay I see, will you be implementing this?
If anchor free can work, there will no this issue. I will keep finding the way to make anchor free model work.
Hello, I think YOLOv4-tiny (3l) is far faster and better than YOLOX-tiny, nanodet, PP-YOLO-tiny... YOLOv5s: 36.7% AP, 115 fps (model inference only) YOLOXs: 39.6% AP, 102 fps (model inference only) YOLOv4s: 38.4% AP, 143 fps (end-to-end inference) YOLOv4-tiny (3l): 38.6% AP, 182 fps (end-to-end inference)
@WongKinYiu Hmm, I'm afraid that you made a mistake. I think yolov4-tiny(3l)'s 38.6 is map@0.5, while all the other results are map@0.5:0.95. In your paper, I found map for yolov4-tiny(3l) is 28.7%. So yolov4-tiny should not be compared with these models. Instead, it should be compared with yolox-tiny, ppyolo-tiny, nanodet.
I tested the YOLOX-tiny:
on the validation data, YOLOX-tiny can get AP@0.5:0.95 : 0.3227, AP@0.5: 0.493.
input resolution 320x320: 28.7% AP input resolution 640x640: 38.6% AP
For the YOLOX-tiny, the test resolution is: 416*416. according to: https://github.com/Megvii-BaseDetection/YOLOX/blob/a3f1c644aa5a2617a205c43d4b2e72e180ab6eff/exps/default/yolox_tiny.py
For the YOLOX-tiny training, the default resolution: 640640, but it also did random size: (10,20)32, so train resolution is: from 320 to 640.
I also tested YOLOX-Nano:
The inference resolution: 416*416 On the validation, I got mAP@0.5:0.95: 0.2387 AP@0.5: 0.39
YOLOX-nano used depthwise = True which is the main difference.
@AlexeyAB would love to hear your thoughts in YOLOX. Looks like some of the higher performing variants available via this repo are not mentioned.
pdf: https://arxiv.org/pdf/2107.08430.pdf