yolov7 inference slower than yolov5

NicholasZollo commented 2 years ago

I tried to do a comparison of inference speed with yolov7 and yolov5m trained on a custom dataset running on Tesla T4 16GB gpu. The paper claims that yolov7 should be significantly faster here, however on my testing the inference time on yolov7 was twice that of yolov5m. It seems that the inference time I'm getting is only proportional to the FLOPS of the model. To do the test I used --task speed flag on test.py in yolov7 and val.py on yolov5. I made sure that they were running on gpu, not the cpu, but this was still the case.

WongKinYiu commented 2 years ago

I guess your are running batch 32 inference. For batch 32 inference, YOLOv7 takes 2.8 ms average inference time, and YOLOv5m takes 1.7 ms average inference time in the paper.

NicholasZollo commented 2 years ago

I tried running batch size 1 inference, it increased the inference time for both but it did not make yolov7 run faster than yolov5m still. Is my method of speed testing, by running the test.py and val.py for yolov7/yolov5 with the --task speed flag correct?

WongKinYiu commented 2 years ago

What are the inference time you get on yolov7-tiny, yolov7, yolov5n, yolov5s, yolov5m, and yolov5l.

yousefis commented 2 years ago

I am experiencing the same. I used the below settings:

python test.py --data data/test_yolo.yaml --img 640 --batch 1 --conf 0.001 --iou 0.65 --device cpu --weights yolov7.pt --name yolov7_640_val

and the result:

Speed: 439.6/1.3/440.9 ms inference/NMS/total per 640x640 image at batch-size 1 While for Yolov5, when I am running the below command for the same images, I get the following results".

python \yolov5\detect.py --source inference/images --device cpu
detect: weights=..\FFD\FFD_pipeline\yolov5\yolov5s.pt, source=inference/images, data=..\FFD\FFD_pipeline\yolov5\data\coco128.yaml, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=cpu, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=..\FFD\FFD_pipeline\yolov5\runs\detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False
YOLOv5  2022-7-5 Python-3.8.13 torch-1.11.0+cpu CPU

Fusing layers...
YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients
image 1/2 D:\Code\yolov7\inference\images\horses.jpg: 448x640 5 horses, Done. (0.107s)
image 2/2 D:\Code\yolov7\inference\images\horses1.jpg: 448x640 5 horses, Done. (0.089s)
Speed: 0.0ms pre-process, 98.0ms inference, 1.5ms NMS per image at shape (1, 3, 640, 640)
Results saved to exp9

Any idea?

WongKinYiu commented 2 years ago

CPU inference time is usually proportional to FLOPs.

NicholasZollo commented 2 years ago

using the pretrained weights (except yolov7-tiny) on coco 2017 dataset w/ telsa t4 gpu:

yolov5: python val.py --data data/coco.yaml --weights [model] --batch-size 1 --imgsz 640 --task speed --device 0

yolov5n: mAP@.5 0.535 mAP@.5:.95 0.359 0.2ms pre-process, 4.5ms inference, 0.7ms NMS yolov5s: mAP@.5 0.616 mAP@.5:.95 0.439 0.2ms pre-process, 4.7ms inference, 0.7ms NMS yolov5m: mAP@.5 0.672 mAP@.5:.95 0.509 0.2ms pre-process, 6.8ms inference, 0.7ms NMS yolov5l: mAP@.5 0.701 mAP@.5:.95 0.546 0.2ms pre-process, 10.6ms inference, 0.7ms NMS

yolov7: python test.py --data data/coco.yaml --weights [model] --batch-size 1 --img-size 640 --task speed --device 0

yolov7-tiny: mAP@.5 0.349 mAP@.5:.95 0.236 5.0/0.7/5.6 ms inference/NMS/total (Trained for only 36 epochs) yolov7: mAP@.5 0.616 mAP@.5:.95 0.46 11.9/0.7/12.6 ms inference/NMS/total

I did notice that the mAP values displayed are not consistent with the evaluated pycocotools mAP which is the mAP consistent with the claimed mAP values and paper, so that may not be important. however the speed is coming out worse than claimed. There are some variations in the inference time but they are minor.

WongKinYiu commented 2 years ago

I can not reproduce your results, but we have tested YOLOv7-tiny on both PyTorch and darknet, they showed consistent results. Maybe you could run experiment on darknet to check if your pytorch performance on YOLOv7 is normal or not. darknet.exe detector demo cfg/coco.data cfg/yolov7-tiny.cfg yolov7-tiny.weights test.mp4 -benchmark

Also it really strange of your posted results since T4 GPU is slower than V100, and your T4 inference time is about 30% faster than official u5 V100 inference time. Your T4 performance also more than twice faster than official u5 reported benchmark.

Other people also help us to benchmark on tensorrt, YOLOv7-tiny run about twice faster than YOLOv5s.

polar99 commented 2 years ago

I have used my laptop(GPU is GTX1650) to run yolov7 and yolov5-l. At first, it seems yolov7(150ms/image) is slower than yolov5-l(70ms/image). But I found this issue. When set half=False, the speed of yolov7 is becoming faster (60~70ms/image) which is colosed to yolov5-l. In my opinion, some NVIDIA GPUs don't support half inference well. Using 'half' inference may be harmful. It needs to set half=False for faster inference speed in such devices. Besides, as for parameters or model size, yolov7 is smaller than yolov5-l. So, yolov7 is more efficient.

zhengzhigang1979 commented 2 years ago

I also have the confuse. and attached image is the comparison of inference speed between yolov7 and yolov5s6 and yolov7-tiny and yolov5n6。 inference speed of yolov7 is 0.152s and yolov5s6 is 0.011s inference speed of yolov7-tiny is 0.039s and yolov5n6 is 0.007s

please help me to explain the reason of the result. tks

JNH-LD commented 2 years ago

I also have the confuse. and attached image is the comparison of inference speed between yolov7 and yolov5s6 and yolov7-tiny and yolov5n6。 inference speed of yolov7 is 0.152s and yolov5s6 is 0.011s inference speed of yolov7-tiny is 0.039s and yolov5n6 is 0.007s

please help me to explain the reason of the result. tks I also have the confuse. I do not believe yolov7 faster than yolov5.

AlexeyAB commented 2 years ago

When tested in an identical environment on a nVidia T4 GPU:

YOLOv7 (51.2% AP, 12.7ms) is 1.5x times faster and +6.3% AP more accurate than YOLOv5s6 (44.9% AP, 18.7ms)

https://colab.research.google.com/gist/AlexeyAB/56912451a33981d977ff9ea61025ae40/yolov7trtlinaom.ipynb#scrollTo=-tMYe8f27US9

!python test.py --data data/coco.yaml --img 640 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov7.pt --name yolov7_640_val
...
Speed: 12.6/0.9/13.5 ms inference/NMS/total per 640x640 image at batch-size 1
...
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.512

!python val.py --data data/coco.yaml --img 1280 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov5s6.pt --name yolov5s6_1280_val
...
Speed: 0.7ms pre-process, 18.7ms inference, 1.7ms NMS per image at shape (1, 3, 1280, 1280)
...
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.449

YOLOv7 (51.2% AP, 12.6ms) has almost the same accuracy but 4x times faster than YOLOv5m6 (51.3% AP, 49.1ms)

https://colab.research.google.com/gist/AlexeyAB/857c4859a7a27abca8775245884d1ecf/yolov7trtlinaom.ipynb

!python test.py --data data/coco.yaml --img 640 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov7.pt --name yolov7_640_val
...
Speed: 12.6/0.9/13.5 ms inference/NMS/total per 640x640 image at batch-size 1
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.512

!python val.py --data data/coco.yaml --img 1280 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov5m6.pt --name yolov5m6_1280_val
...
Speed: 0.6ms pre-process, 49.1ms inference, 1.7ms NMS per image at shape (1, 3, 1280, 1280)
...
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.513

More over, YOLOv7-w6 1280x1280 (54.6% AP, 29ms) has comparable accuracy but 6.6x times faster than YOLOv5x6 1280x1280 (55.0% AP, 192ms)

mkhoshbin72 commented 2 years ago

I tested YOLOv7 on NVIDIA GeForce GTX 1080 Ti and NVIDIA GeForce RTX 3070. On 3070, YOLOv7 inference speed is approximately 50% less than 1080. Consider this in speed tests.

mohaghighat commented 2 years ago

We ran the inference in OpenCV using the ONNX converted models for a single image of size 640x640. All YOLOv7 versions seem to be slower than YOLOv4 and YOLOv5l. Any idea why this is the case?

AlexeyAB commented 2 years ago

It is strange that you get 56 FPS (18ms) for yolov7.pt on Titan RTX (130 TFlops-TC), while there is higher 79 FPS (12.6ms) on GPU T4 (65 TFLops-TC) while Titan RTX is twice more powerful GPU: https://colab.research.google.com/gist/AlexeyAB/857c4859a7a27abca8775245884d1ecf/yolov7trtlinaom.ipynb

YOLOv7 (51.2% AP, 12.6ms) has almost the same accuracy but 4x times faster than YOLOv5m6 (51.3% AP, 49.1ms)

There seems to be something wrong with the ONNX converter or the ONNX inference code.

Have you integrated NMS into YOLOv7-onnx model as shown in our readme file, and did you evaluate YOLOv5 without NMS?

What batch size, float precision, tesnor cores, export code, inference code, number of test images, warmup, nms, ... did you use?

Nuwan1654 commented 2 years ago

It is strange that you get 56 FPS (18ms) for yolov7.pt on Titan RTX (130 TFlops-TC), while there is higher 79 FPS (12.6ms) on GPU T4 (65 TFLops-TC) while Titan RTX is twice more powerful GPU: https://colab.research.google.com/gist/AlexeyAB/857c4859a7a27abca8775245884d1ecf/yolov7trtlinaom.ipynb

YOLOv7 (51.2% AP, 12.6ms) has almost the same accuracy but 4x times faster than YOLOv5m6 (51.3% AP, 49.1ms)

There seems to be something wrong with the ONNX converter or the ONNX inference code.

Have you integrated NMS into YOLOv7-onnx model as shown in our readme file, and did you evaluate YOLOv5 without NMS?

What batch size, float precision, tesnor cores, export code, inference code, number of test images, warmup, nms, ... did you use?

@AlexeyAB https://github.com/WongKinYiu/yolov7/issues/400#issue-1325396557, we tried with the OpenCV inference. but got the error mentioned in this issue. Also when inferencing with the ONNX runtime we got low FPS.

batch size = 1,
float precision = 16, tensor cores = 576, export code = https://github.com/WongKinYiu/yolov7/blob/main/export.py, inference code = used OpenCV function readNetFromONNX() and measure the elapsed time for single inference. Did that for a set of images (~500) and then got the average value

Because of the error mentioned in the issue we omit the --grid in the exporting command mentioned in the readme.

zhjw0927 commented 2 years ago

I compared the speed and mAP of yolov7 and yolov5s6 on coco128 using RTX2060 (it is the same with T4, both have tensor core). for yolov7: python test.py --data data/coco128.yaml --img 640 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov7.pt Screenshot from 2022-09-14 09-52-35 for yolov5s6: python val.py --data data/coco128.yaml --img 1280 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov5s6.pt The conclusion is that the mAP of yolov7 is better, and the mAP of yolov7 under the input of 640 can exceed the mAP of yolov5s under the input of 1280, so the paper only pays attention to the comparison of the inference time of 640 and 1280, and do not care the comparison under the same resolution. This may be one of the reasons why the inference speed of yolov7 is slower than that of yolov5 in the above comparison, because they use the same resolution.

More importantly, yolov7 uses half inference by default, while yolov5 does not use it by default. So in the above experimental results, yolov7 seems to be faster than yolov5s6, but this is just an illusion.

for yolov5s6 half: python val.py --data data/coco128.yaml --img 1280 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov5s6.pt --half Screenshot from 2022-09-14 10-03-35

So, yolov5 still has excellent speed performance under the input of 1280, but it is undeniable that the mAP of yolov7 under 640 is also excellent enough.

Supplement, yolov7's inference speed under fp32. Modify the parameter (half_precision) in the test function in the test.py to False and run python test.py --data data/coco128.yaml --img 640 --batch 1 --conf 0.001 --iou 0.65 --device 0 --weights yolov7.pt Screenshot from 2022-09-14 10-08-20

StefanCiobanu1989 commented 1 year ago

I tested YOLOv7 on NVIDIA GeForce GTX 1080 Ti and NVIDIA GeForce RTX 3070. On 3070, YOLOv7 inference speed is approximately 50% less than 1080. Consider this in speed tests.

Did you ever figure out a fix ?

WongKinYiu / yolov7

yolov7 inference slower than yolov5 #238