SOTA claims vs leaderboards mismalignment

LifeIsStrange commented 1 year ago

@WongKinYiu @AlexeyAB Hi friendly pings

YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100.

Weird claim when you actually rank #20 on COCO If we exclude all models with extra training data you still rank #11. the #1 without extra data is Dual-Swin-L(HTC, multi-scale), with 60.1 box AP with extra data it is DINO(Swim-L,multi-scale) with 63.3 box AP

AlexeyAB commented 1 year ago

They are much slower than 5 FPS on GPU Tesla V100, and they are not Real-time.

Dual-Swin-L (HTC) 1600x1600 - 59.1% AP - 1.5 FPS V100 - isn't real-time - is 2000% FSP slower than YOLOv7-e6e
Dual-Swin-L(HTC, multi-scale) - 60.1% AP - 0.3 FPS V100 - isn't real-time - is 12000% FPS slower than YOLOv7-e6e
DINO-5scale-R50 (10 FPS, 51.0% AP) is less accurate and 1500% FPS slower than YOLOv7 (161 FPS, 51.2% AP)
DINO(Swim-L,multi-scale) with 63.3 box AP - additional training datasets are used (so no fair comparison), no publicly available code and models, it is slower than 1 FPS - isn't real-time is ~10000% slower than YOLOv7-e6e

There are Dual-Swin-L (HTC) and DINO-5scale (R50) in the Table 9: https://arxiv.org/abs/2207.02696

LifeIsStrange commented 1 year ago

@AlexeyAB Great answer! I can see the significant value proposition of this implementation now :) So how about you update the abstract from

YOLOv7 surpasses all known object detectors

to

YOLOv7 surpasses all known real-time object detectors

bonus question: how does it compare to the recently anounced YOLOv6? https://github.com/meituan/YOLOv6

AlexeyAB commented 1 year ago

YOLOv7 surpasses all known object detectors

to

YOLOv7 surpasses all known real-time object detectors

Real-time is 30 FPS or higher.

YOLOv7 surpasses not only real-time detectors from 30 to 160 FPS, but also non-real-time detectors in the range from 4 to 30 FPS.

more

how does it compare to the recently anounced YOLOv6? https://github.com/meituan/YOLOv6

Page 11: https://arxiv.org/pdf/2207.02696.pdf

yolov6_bad

LifeIsStrange commented 1 year ago

@AlexeyAB Fair enough, I wish every paper would defend their value as well as you did, in an evidence based way :). However, it seems to me that YOLOR-D6 beat (in some FPS range at least) YOLOv7. YOLOR-D6 is not YOLOv6, it achieve 57.3% AP which is 0.5% more than YOLOv7, and has 34fps while YOLOv7 has 36fps if I understand correctly. Still YOLOR-D6 is using extra training data indeed. But at the end of the day, end users want a fast model with the best accuracy and will generally accept extra training data for pragmatism sake. Hence the following questions: Do you plan on making a YOLOv7 version with improved accuracy via leveraging extra training data? Secondly, I believe you can improve the state of the art while not significantly altering performance, by being the first to use the following very simple to use innovations, for object detection. https://github.com/lessw2020/Ranger21 or https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer https://arxiv.org/abs/2106.13731

it includes generally applicable innovations that improve accuracy, such as: https://github.com/digantamisra98/Mish The mish activation function is in most cases the best activation function, often yielding 0.5-1% accuracy increase for free. Ranger can in addition use gradient centralization, https://github.com/Yonghongwei/Gradient-Centralization which generally also give free gains. then it can use a synergetic combination of optimizers, such as RAdam in place of Adam https://github.com/LiyuanLucasLiu/RAdam + the complementary LookAhead https://github.com/michaelrzhang/lookahead and others

his library makes the integration and selection of optimizations passes easy. It is a tragedy that those innovations are generally ignored by all despite their huge potential in increasing SOTA for free, in key tasks.

AlexeyAB commented 1 year ago

Still YOLOR-D6 is using extra training data indeed. But at the end of the day, end users want a fast model with the best accuracy and will generally accept extra training data for pragmatism sake.

If you will train your own model on your custom dataset, you will get higher accuracy for YOLOv7 than for YOLOR. And YOLOv7 is faster.

silvada95 commented 1 year ago

What is the definition that you use to define a detector as real-time or not? I saw a lot of authors mentioning it on their works, but no definition at all...

SteTala97 commented 1 year ago

What is the definition that you use to define a detector as real-time or not?

AlexeyAB commented on Jul 10, 2022:

Real-time is 30 FPS or higher.

So, real-time is 30FPS or higher. It commonly refers to the fact that if you have your input coming from a 30FPS camera, or you are processing a video captured by a 30FPS camera (which usually is the most common video frame rate used), you have no delay between one frame and the next one. Of course this also means that if the input rate of your system is e.g. 10FPS, a model that performs at 10FPS can be considered "real-time" for your application.

WongKinYiu / yolov7

SOTA claims vs leaderboards mismalignment #40