Latency Report - Hardware

levipereira commented 3 months ago

I have been trying to find the hardware specifications used for measuring latency, specifically the GPU, GPU clock speed, and CPU model. Unfortunately, I couldn't find this information. Could anyone provide details on the hardware setup used for these measurements?

Burhan-Q commented 3 months ago

See section 4.1 in the publication:

Moreover, the latencies of all models are tested on T4 GPU with TensorRT FP16, following [71].

levipereira commented 3 months ago

Thank you for showing me where it was, I didn't see it even though I looked for it. I have quantized YOLOv9 using QAT with minimal loss of accuracy. I will try to do the same with YOLOv10 but I need to understand all the modules first.

Check results with YOLOv9 https://github.com/levipereira/yolov9-qat?tab=readme-ov-file#latencythroughput-report---tensorrt

levipereira commented 3 months ago

@jameslahm Added Report on RTX 4090. (will enable QAT and check results)

Below is a report comparing the latest YOLO models (YOLOv9 and YOLOv10) on RTX4090. Although both have very similar precision, the main benefit in the comparison lies in the latency and the significant advancement of incorporating built-in NMS.

Device

GPU
Device	NVIDIA GeForce RTX 4090
Compute Capability	8.9
SMs	128
Device Global Memory	24207 MiB
Application Compute Clock Rate	2.58 GHz
Application Memory Clock Rate	10.501 GHz

TensorRT version: 8.5.3

Model Name	Throughput (qps)	Latency (99%) (ms)
yolov10n	2039	0.49
yolov10s	1539	0.65
yolov10m	971	1.03
yolov10b	854	1.17
yolov10l	689	1.45
yolov10x	501	1.99

yolov9-c-converted	825	1.21
yolov9-e-converted	357	2.80

YOLOv9 does not have built-in NMS, meaning it requires additional post-processing for NMS, which introduces extra latency. In contrast, YOLOv10 performs NMS automatically within the model, eliminating the need for additional processing time. Therefore, the latency reported for YOLOv10 already includes the NMS, whereas for YOLOv9, the NMS latency is not included in our report.

  trtexec \
    --onnx="yolov10n.onnx" \
    --fp16 \
    --saveEngine="yolov10n.engine" \
    --timingCacheFile="yolov10n.engine.timing.cache" \
    --warmUp=500 \
    --duration=10 \
    --useCudaGraph \
    --useSpinWait \
    --noDataTransfers

Important: --useSpinWait flag to enable synchronizations using the spin-wait mode for more stable latency measurements

jameslahm commented 3 months ago

Thanks for your efforts and detailed evaluation!

THU-MIG / yolov10

Latency Report - Hardware #44

Device

TensorRT version: 8.5.3