THU-MIG / yolov10

YOLOv10: Real-Time End-to-End Object Detection
https://arxiv.org/abs/2405.14458
GNU Affero General Public License v3.0
9.28k stars 865 forks source link

Latency Report - Hardware #44

Closed levipereira closed 3 months ago

levipereira commented 3 months ago

I have been trying to find the hardware specifications used for measuring latency, specifically the GPU, GPU clock speed, and CPU model. Unfortunately, I couldn't find this information. Could anyone provide details on the hardware setup used for these measurements?

Burhan-Q commented 3 months ago

See section 4.1 in the publication:

Moreover, the latencies of all models are tested on T4 GPU with TensorRT FP16, following [71].

levipereira commented 3 months ago

Thank you for showing me where it was, I didn't see it even though I looked for it. I have quantized YOLOv9 using QAT with minimal loss of accuracy. I will try to do the same with YOLOv10 but I need to understand all the modules first.

Check results with YOLOv9 https://github.com/levipereira/yolov9-qat?tab=readme-ov-file#latencythroughput-report---tensorrt

levipereira commented 3 months ago

@jameslahm Added Report on RTX 4090. (will enable QAT and check results)

Below is a report comparing the latest YOLO models (YOLOv9 and YOLOv10) on RTX4090. Although both have very similar precision, the main benefit in the comparison lies in the latency and the significant advancement of incorporating built-in NMS.

Device

GPU
Device NVIDIA GeForce RTX 4090
Compute Capability 8.9
SMs 128
Device Global Memory 24207 MiB
Application Compute Clock Rate 2.58 GHz
Application Memory Clock Rate 10.501 GHz

TensorRT version: 8.5.3

Model Name Throughput (qps) Latency (99%) (ms)
yolov10n 2039 0.49
yolov10s 1539 0.65
yolov10m 971 1.03
yolov10b 854 1.17
yolov10l 689 1.45
yolov10x 501 1.99
yolov9-c-converted 825 1.21
yolov9-e-converted 357 2.80

YOLOv9 does not have built-in NMS, meaning it requires additional post-processing for NMS, which introduces extra latency. In contrast, YOLOv10 performs NMS automatically within the model, eliminating the need for additional processing time. Therefore, the latency reported for YOLOv10 already includes the NMS, whereas for YOLOv9, the NMS latency is not included in our report.

  trtexec \
    --onnx="yolov10n.onnx" \
    --fp16 \
    --saveEngine="yolov10n.engine" \
    --timingCacheFile="yolov10n.engine.timing.cache" \
    --warmUp=500 \
    --duration=10 \
    --useCudaGraph \
    --useSpinWait \
    --noDataTransfers

Important: --useSpinWait flag to enable synchronizations using the spin-wait mode for more stable latency measurements

jameslahm commented 3 months ago

Thanks for your efforts and detailed evaluation!