THU-MIG / yolov10

YOLOv10: Real-Time End-to-End Object Detection [NeurIPS 2024]
https://arxiv.org/abs/2405.14458
GNU Affero General Public License v3.0
9.9k stars 978 forks source link

Yolov10n slower than Yolov8n #88

Closed NeuralAIM closed 5 months ago

NeuralAIM commented 5 months ago

Code:

#yolo export model=yolov{v}n.pt format=onnx imgsz=160
import onnxruntime
onnxruntime.set_default_logger_severity(1)
import numpy as np
import time

model_path = "yolov8n.onnx" #yolov10n.onnx

sessOptions = onnxruntime.SessionOptions()
#sessOptions.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL #no speedup
session = onnxruntime.InferenceSession(model_path, session_options=sessOptions, providers=['DmlExecutionProvider'])

frame_count = 0
start_time = time.time()
input_data = np.random.randn(1, 3, 160, 160).astype(np.float32)

while True:
    results = session.run(None, {'images': input_data})
    #print(results)
    frame_count += 1
    if time.time() - start_time >= 1:
        fps = frame_count / (time.time() - start_time)
        print(f"FPS: {fps:.2f}")
        start_time = time.time()
        frame_count = 0

Had tried with opset=13 and opset=17

Yolov8n:

FPS: 1179.21
FPS: 1215.69
FPS: 1219.65
FPS: 1211.25
FPS: 1207.61

Yolov10n:

...
2024-05-27 22:09:53.0768502 [W:onnxruntime:, session_state.cc:1166 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-05-27 22:09:53.0796189 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
FPS: 912.99
FPS: 940.92
FPS: 937.78
FPS: 937.25
FPS: 935.74

_onnxruntime.set_default_loggerseverity(0):

2024-05-27 22:11:53.2764365 [I:onnxruntime:, transformer_memcpy.cc:329 onnxruntime::TransformerMemcpyImpl::AddCopyNode] Add MemcpyFromHost after /model.23/Mod_output_0 for DmlExecutionProvider
2024-05-27 22:11:53.2782575 [I:onnxruntime:, transformer_memcpy.cc:329 onnxruntime::TransformerMemcpyImpl::AddCopyNode] Add MemcpyToHost before /model.23/TopK_1_output_1 for DmlExecutionProvider
2024-05-27 22:11:53.2801005 [I:onnxruntime:, graph_transformer.cc:15 onnxruntime::GraphTransformer::Apply] GraphTransformer MemcpyTransformer modified: 1 with status: OK
2024-05-27 22:11:53.2817768 [V:onnxruntime:, session_state.cc:1146 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Node placements
2024-05-27 22:11:53.2832877 [V:onnxruntime:, session_state.cc:1152 onnxruntime::VerifyEachNodeIsAssignedToAnEp]  Node(s) placed on [CPUExecutionProvider]. Number of nodes: 1       
2024-05-27 22:11:53.2850396 [V:onnxruntime:, session_state.cc:1154 onnxruntime::VerifyEachNodeIsAssignedToAnEp]   Mod (/model.23/Mod)
2024-05-27 22:11:53.2863578 [V:onnxruntime:, session_state.cc:1152 onnxruntime::VerifyEachNodeIsAssignedToAnEp]  Node(s) placed on [DmlExecutionProvider]. Number of nodes: 4       
2024-05-27 22:11:53.2879198 [V:onnxruntime:, session_state.cc:1154 onnxruntime::VerifyEachNodeIsAssignedToAnEp]   DmlFusedNode_0_0 (DmlFusedNode_0_0)
2024-05-27 22:11:53.2894663 [V:onnxruntime:, session_state.cc:1154 onnxruntime::VerifyEachNodeIsAssignedToAnEp]   DmlFusedNode_1_5 (DmlFusedNode_1_5)
2024-05-27 22:11:53.2908286 [V:onnxruntime:, session_state.cc:1154 onnxruntime::VerifyEachNodeIsAssignedToAnEp]   MemcpyFromHost (Memcpy)
2024-05-27 22:11:53.2920996 [V:onnxruntime:, session_state.cc:1154 onnxruntime::VerifyEachNodeIsAssignedToAnEp]   MemcpyToHost (Memcpy_token_1)
2024-05-27 22:11:53.2934901 [W:onnxruntime:, session_state.cc:1166 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.

But, the higher the image size, the smaller this difference is, and at 640x640, the v10 begins to gain speed from the v8

NeuralAIM commented 5 months ago

I use method from - #72

image

After:

image

And get:

FPS: 1148.52
FPS: 1110.67
FPS: 1162.14
FPS: 1155.30
FPS: 1163.33

But it's still slower than yolov8n. *I use 3070ti and 4070 laptop, same results

NeuralAIM commented 5 months ago

When using half - the output speed does not increase, although the size of the model becomes 2 times smaller: yolo export model=yolov10n.pt format=onnx imgsz=160 half=True device=0

Speed:

FPS: 1135.99
FPS: 1125.64
FPS: 1139.78
FPS: 1139.14
FPS: 1138.75
jameslahm commented 5 months ago

Thanks for your interest! We thought that there are some suggestions and questions.

  1. Could you please try to add simplify when exporting the onnx?
  2. Does the warning Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. still exist after removing the postprocessing of YOLOv10?
  3. In the first comment, it's an unfair comparison because that YOLOv10 includes the postprocessing and YOLOv8 does not contain the postprocessing step. Could you please incorporate the postprocessing into the onnx of YOLOv8 and evaluate again?
  4. In the second comment, it shows that when only considering the model forward process, the FPS of YOLOv10 is a little lower than that of YOLOv8. This is consistent with our benchmark (In Latency$^f$, YOLOv8n: 1.77ms and YOLOv10n: 1.79ms)
  5. Regarding the problem that the output speed does not increase when using half, does DmlExecutionProvider support fp16?
laugh12321 commented 5 months ago

@NeuralAIM @jameslahm

I added the EfficientNMS plugin (conf 0.25, iou 0.65, max_det 100) for both Yolov10n (https://github.com/THU-MIG/yolov10/pull/29) and Yolov8n (https://github.com/laugh12321/TensorRT-YOLO) for post-processing to ensure that their input and output node counts and dimensions are identical. Then, I conducted performance tests using trtexec --fp16 on a system equipped with a GPU RTX 2080Ti 22GB, AMD Ryzen 7 5700X 8-Core processor, and 128GB RAM. The results are as follows.

YOLOv10n

trtexec --onnx=yolov10n.onnx --saveEngine=yolov10n.engine --fp16
[05/28/2024-09:31:49] [I] === Performance summary ===
[05/28/2024-09:31:49] [I] Throughput: 452.717 qps
[05/28/2024-09:31:49] [I] Latency: min = 1.95093 ms, max = 3.67439 ms, mean = 2.20258 ms, median = 2.13184 ms, percentile(90%) = 2.44647 ms, percentile(95%) = 2.63928 ms, percentile(99%) = 2.93274 ms
[05/28/2024-09:31:49] [I] Enqueue Time: min = 1.09033 ms, max = 3.86938 ms, mean = 2.10023 ms, median = 2.07922 ms, percentile(90%) = 2.35352 ms, percentile(95%) = 2.63934 ms, percentile(99%) = 3.11755 ms
[05/28/2024-09:31:49] [I] H2D Latency: min = 0.376801 ms, max = 0.538086 ms, mean = 0.385375 ms, median = 0.378418 ms, percentile(90%) = 0.401978 ms, percentile(95%) = 0.410156 ms, percentile(99%) = 0.456177 ms
[05/28/2024-09:31:49] [I] GPU Compute Time: min = 1.55652 ms, max = 3.25937 ms, mean = 1.79524 ms, median = 1.72491 ms, percentile(90%) = 2.03784 ms, percentile(95%) = 2.2323 ms, percentile(99%) = 2.49048 ms
[05/28/2024-09:31:49] [I] D2H Latency: min = 0.0078125 ms, max = 0.123169 ms, mean = 0.0219708 ms, median = 0.0233154 ms, percentile(90%) = 0.0400391 ms, percentile(95%) = 0.0466309 ms, percentile(99%) = 0.0657959 ms
[05/28/2024-09:31:49] [I] Total Host Walltime: 3.00629 s
[05/28/2024-09:31:49] [I] Total GPU Compute Time: 2.44332 s
[05/28/2024-09:31:49] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/28/2024-09:31:49] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/28/2024-09:31:49] [W] * GPU compute time is unstable, with coefficient of variance = 11.1813%.
[05/28/2024-09:31:49] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-09:31:49] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-09:31:49] [I]

YOLOv8n

trtexec --onnx=yolov8n.onnx --saveEngine=yolov8n.engine --fp16
[05/28/2024-09:37:20] [I] === Performance summary ===
[05/28/2024-09:37:20] [I] Throughput: 507.55 qps
[05/28/2024-09:37:20] [I] Latency: min = 1.73242 ms, max = 2.74728 ms, mean = 1.93398 ms, median = 1.90881 ms, percentile(90%) = 2.03448 ms, percentile(95%) = 2.1394 ms, percentile(99%) = 2.45068 ms
[05/28/2024-09:37:20] [I] Enqueue Time: min = 0.944092 ms, max = 2.87231 ms, mean = 1.86921 ms, median = 1.87592 ms, percentile(90%) = 1.95801 ms, percentile(95%) = 2.00525 ms, percentile(99%) = 2.55115 ms
[05/28/2024-09:37:20] [I] H2D Latency: min = 0.377319 ms, max = 0.496002 ms, mean = 0.385183 ms, median = 0.378784 ms, percentile(90%) = 0.402039 ms, percentile(95%) = 0.404053 ms, percentile(99%) = 0.443115 ms
[05/28/2024-09:37:20] [I] GPU Compute Time: min = 1.31958 ms, max = 2.3558 ms, mean = 1.52645 ms, median = 1.50269 ms, percentile(90%) = 1.62378 ms, percentile(95%) = 1.72607 ms, percentile(99%) = 2.02539 ms
[05/28/2024-09:37:20] [I] D2H Latency: min = 0.0078125 ms, max = 0.109375 ms, mean = 0.0223493 ms, median = 0.0247803 ms, percentile(90%) = 0.0297852 ms, percentile(95%) = 0.0429688 ms, percentile(99%) = 0.0550537 ms
[05/28/2024-09:37:20] [I] Total Host Walltime: 3.00266 s
[05/28/2024-09:37:20] [I] Total GPU Compute Time: 2.32631 s
[05/28/2024-09:37:20] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/28/2024-09:37:20] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/28/2024-09:37:20] [W] * GPU compute time is unstable, with coefficient of variance = 6.9708%.
[05/28/2024-09:37:20] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-09:37:20] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-09:37:20] [I]
jameslahm commented 5 months ago

Thanks. As we discussed in the https://github.com/THU-MIG/yolov10/pull/29, there is no need to add EfficientNMS plugin into YOLOv10. Besides, the way you benchmark the TensorRT engines with NMS plugin may be biased because the classification score output is meaningless due to the random input, that is the time of NMS is not correct.

@NeuralAIM @jameslahm

I added the EfficientNMS plugin (conf 0.25, iou 0.65, max_det 100) for both Yolov10n (#29) and Yolov8n (https://github.com/laugh12321/TensorRT-YOLO) for post-processing to ensure that their input and output node counts and dimensions are identical. Then, I conducted performance tests using trtexec --fp16 on a system equipped with a GPU RTX 2080Ti 22GB, AMD Ryzen 7 5700X 8-Core processor, and 128GB RAM. The results are as follows.

YOLOv10n

trtexec --onnx=yolov10n.onnx --saveEngine=yolov10n.engine --fp16
[05/28/2024-09:31:49] [I] === Performance summary ===
[05/28/2024-09:31:49] [I] Throughput: 452.717 qps
[05/28/2024-09:31:49] [I] Latency: min = 1.95093 ms, max = 3.67439 ms, mean = 2.20258 ms, median = 2.13184 ms, percentile(90%) = 2.44647 ms, percentile(95%) = 2.63928 ms, percentile(99%) = 2.93274 ms
[05/28/2024-09:31:49] [I] Enqueue Time: min = 1.09033 ms, max = 3.86938 ms, mean = 2.10023 ms, median = 2.07922 ms, percentile(90%) = 2.35352 ms, percentile(95%) = 2.63934 ms, percentile(99%) = 3.11755 ms
[05/28/2024-09:31:49] [I] H2D Latency: min = 0.376801 ms, max = 0.538086 ms, mean = 0.385375 ms, median = 0.378418 ms, percentile(90%) = 0.401978 ms, percentile(95%) = 0.410156 ms, percentile(99%) = 0.456177 ms
[05/28/2024-09:31:49] [I] GPU Compute Time: min = 1.55652 ms, max = 3.25937 ms, mean = 1.79524 ms, median = 1.72491 ms, percentile(90%) = 2.03784 ms, percentile(95%) = 2.2323 ms, percentile(99%) = 2.49048 ms
[05/28/2024-09:31:49] [I] D2H Latency: min = 0.0078125 ms, max = 0.123169 ms, mean = 0.0219708 ms, median = 0.0233154 ms, percentile(90%) = 0.0400391 ms, percentile(95%) = 0.0466309 ms, percentile(99%) = 0.0657959 ms
[05/28/2024-09:31:49] [I] Total Host Walltime: 3.00629 s
[05/28/2024-09:31:49] [I] Total GPU Compute Time: 2.44332 s
[05/28/2024-09:31:49] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/28/2024-09:31:49] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/28/2024-09:31:49] [W] * GPU compute time is unstable, with coefficient of variance = 11.1813%.
[05/28/2024-09:31:49] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-09:31:49] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-09:31:49] [I]

YOLOv8n

trtexec --onnx=yolov8n.onnx --saveEngine=yolov8n.engine --fp16
[05/28/2024-09:37:20] [I] === Performance summary ===
[05/28/2024-09:37:20] [I] Throughput: 507.55 qps
[05/28/2024-09:37:20] [I] Latency: min = 1.73242 ms, max = 2.74728 ms, mean = 1.93398 ms, median = 1.90881 ms, percentile(90%) = 2.03448 ms, percentile(95%) = 2.1394 ms, percentile(99%) = 2.45068 ms
[05/28/2024-09:37:20] [I] Enqueue Time: min = 0.944092 ms, max = 2.87231 ms, mean = 1.86921 ms, median = 1.87592 ms, percentile(90%) = 1.95801 ms, percentile(95%) = 2.00525 ms, percentile(99%) = 2.55115 ms
[05/28/2024-09:37:20] [I] H2D Latency: min = 0.377319 ms, max = 0.496002 ms, mean = 0.385183 ms, median = 0.378784 ms, percentile(90%) = 0.402039 ms, percentile(95%) = 0.404053 ms, percentile(99%) = 0.443115 ms
[05/28/2024-09:37:20] [I] GPU Compute Time: min = 1.31958 ms, max = 2.3558 ms, mean = 1.52645 ms, median = 1.50269 ms, percentile(90%) = 1.62378 ms, percentile(95%) = 1.72607 ms, percentile(99%) = 2.02539 ms
[05/28/2024-09:37:20] [I] D2H Latency: min = 0.0078125 ms, max = 0.109375 ms, mean = 0.0223493 ms, median = 0.0247803 ms, percentile(90%) = 0.0297852 ms, percentile(95%) = 0.0429688 ms, percentile(99%) = 0.0550537 ms
[05/28/2024-09:37:20] [I] Total Host Walltime: 3.00266 s
[05/28/2024-09:37:20] [I] Total GPU Compute Time: 2.32631 s
[05/28/2024-09:37:20] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/28/2024-09:37:20] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/28/2024-09:37:20] [W] * GPU compute time is unstable, with coefficient of variance = 6.9708%.
[05/28/2024-09:37:20] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-09:37:20] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-09:37:20] [I]
wsy-yjys commented 5 months ago

@jameslahm Hi, This work is very good, but I am confused about the speed of YOLOv10. I used CUDA11.2, PyTorch2.0.1, onnx1.15.0, TensorRT10.0.1.6 to export YOLOv8 and YOLOv10 to the engine using the following commands and tested the speed on 2080Ti 22G, but found that the Throughput of v8 is bigger than v10. Am I missing something?

  1. Export YOLOv8 and test speed
    yolo export model=yolov8n.pt format=onnx opset=13 simplify
    trtexec --onnx=yolov8n.onnx --saveEngine=yolov8n.engine --fp16
    trtexec --fp16 --avgRuns=1000  --useSpinWait --loadEngine=yolov8n.engine
[05/28/2024-01:43:32] [I] === Performance summary ===
[05/28/2024-01:43:32] [I] Throughput: 893.606 qps
[05/28/2024-01:43:32] [I] Latency: min = 2.09741 ms, max = 2.42079 ms, mean = 2.21249 ms, median = 2.20703 ms, percentile(90%) = 2.21738 ms, percentile(95%) = 2.22119 ms, percentile(99%) = 2.40974 ms
[05/28/2024-01:43:32] [I] Enqueue Time: min = 0.345001 ms, max = 0.474365 ms, mean = 0.350971 ms, median = 0.349854 ms, percentile(90%) = 0.355713 ms, percentile(95%) = 0.358154 ms, percentile(99%) = 0.364075 ms
[05/28/2024-01:43:32] [I] H2D Latency: min = 0.761597 ms, max = 0.783752 ms, mean = 0.77322 ms, median = 0.769043 ms, percentile(90%) = 0.780762 ms, percentile(95%) = 0.78125 ms, percentile(99%) = 0.781982 ms
[05/28/2024-01:43:32] [I] GPU Compute Time: min = 0.89917 ms, max = 1.23401 ms, mean = 0.998447 ms, median = 0.994507 ms, percentile(90%) = 1.00989 ms, percentile(95%) = 1.01392 ms, percentile(99%) = 1.22266 ms
[05/28/2024-01:43:32] [I] D2H Latency: min = 0.418945 ms, max = 0.460205 ms, mean = 0.440824 ms, median = 0.432373 ms, percentile(90%) = 0.457764 ms, percentile(95%) = 0.45813 ms, percentile(99%) = 0.45874 ms
[05/28/2024-01:43:32] [I] Total Host Walltime: 3.00356 s
[05/28/2024-01:43:32] [I] Total GPU Compute Time: 2.67983 s
[05/28/2024-01:43:32] [W] * GPU compute time is unstable, with coefficient of variance = 4.13154%.
[05/28/2024-01:43:32] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-01:43:32] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-01:43:32] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v100001] # trtexec --fp16 --avgRuns=1000 --useSpinWait --loadEngine=yolov8n.engine
  1. Export YOLOv10 and test speed
    yolo export model=yolov10n.pt format=onnx opset=13 simplify
    trtexec --onnx=yolov10n.onnx --saveEngine=yolov10n.engine --fp16
    trtexec --fp16 --avgRuns=1000 --useSpinWait --loadEngine=yolov10n.engine
[05/28/2024-01:43:58] [I] === Performance summary ===
[05/28/2024-01:43:58] [I] Throughput: 870.097 qps
[05/28/2024-01:43:58] [I] Latency: min = 1.83008 ms, max = 2.08215 ms, mean = 1.92808 ms, median = 1.92896 ms, percentile(90%) = 1.93481 ms, percentile(95%) = 1.93713 ms, percentile(99%) = 1.94312 ms
[05/28/2024-01:43:58] [I] Enqueue Time: min = 0.448425 ms, max = 0.541504 ms, mean = 0.455154 ms, median = 0.453735 ms, percentile(90%) = 0.460876 ms, percentile(95%) = 0.463135 ms, percentile(99%) = 0.470215 ms
[05/28/2024-01:43:58] [I] H2D Latency: min = 0.770386 ms, max = 0.779297 ms, mean = 0.775283 ms, median = 0.775269 ms, percentile(90%) = 0.775879 ms, percentile(95%) = 0.776001 ms, percentile(99%) = 0.776428 ms
[05/28/2024-01:43:58] [I] GPU Compute Time: min = 1.05078 ms, max = 1.30176 ms, mean = 1.14739 ms, median = 1.14868 ms, percentile(90%) = 1.1543 ms, percentile(95%) = 1.15637 ms, percentile(99%) = 1.1626 ms
[05/28/2024-01:43:58] [I] D2H Latency: min = 0.00341797 ms, max = 0.00720215 ms, mean = 0.00540887 ms, median = 0.00549316 ms, percentile(90%) = 0.00622559 ms, percentile(95%) = 0.00646973 ms, percentile(99%) = 0.00683594 ms
[05/28/2024-01:43:58] [I] Total Host Walltime: 3.00311 s
[05/28/2024-01:43:58] [I] Total GPU Compute Time: 2.99813 s
[05/28/2024-01:43:58] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-01:43:58] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v100001] # trtexec --fp16 --avgRuns=1000 --useSpinWait --loadEngine=yolov10n.engine
jameslahm commented 5 months ago

Thanks. Please note that the exported YOLOv10 onnx includes the postprocessing step, that is the latency you measured is the end-to-end latency. However, the exported YOLOv8 onnx does not contain the postprocessing step, i.e., the NMS, that is the latency you measured is only its model forward process. The end-to-end latency of YOLOv8 need to count the extra time of NMS postprocessing. So it is unfair to compare a end-to-end latency with a model forward latency.

@jameslahm Hi, This work is very good, but I am confused about the speed of YOLOv10. I used CUDA11.2, PyTorch2.0.1, onnx1.15.0, TensorRT10.0.1.6 to export YOLOv8 and YOLOv10 to the engine using the following commands and tested the speed on 2080Ti 22G, but found that the Throughput of v8 is bigger than v10. Am I missing something?

  1. Export YOLOv8 and test speed
yolo export model=yolov8n.pt format=onnx opset=13 simplify
trtexec --onnx=yolov8n.onnx --saveEngine=yolov8n.engine --fp16
trtexec --fp16 --avgRuns=1000  --useSpinWait --loadEngine=yolov8n.engine
[05/28/2024-01:43:32] [I] === Performance summary ===
[05/28/2024-01:43:32] [I] Throughput: 893.606 qps
[05/28/2024-01:43:32] [I] Latency: min = 2.09741 ms, max = 2.42079 ms, mean = 2.21249 ms, median = 2.20703 ms, percentile(90%) = 2.21738 ms, percentile(95%) = 2.22119 ms, percentile(99%) = 2.40974 ms
[05/28/2024-01:43:32] [I] Enqueue Time: min = 0.345001 ms, max = 0.474365 ms, mean = 0.350971 ms, median = 0.349854 ms, percentile(90%) = 0.355713 ms, percentile(95%) = 0.358154 ms, percentile(99%) = 0.364075 ms
[05/28/2024-01:43:32] [I] H2D Latency: min = 0.761597 ms, max = 0.783752 ms, mean = 0.77322 ms, median = 0.769043 ms, percentile(90%) = 0.780762 ms, percentile(95%) = 0.78125 ms, percentile(99%) = 0.781982 ms
[05/28/2024-01:43:32] [I] GPU Compute Time: min = 0.89917 ms, max = 1.23401 ms, mean = 0.998447 ms, median = 0.994507 ms, percentile(90%) = 1.00989 ms, percentile(95%) = 1.01392 ms, percentile(99%) = 1.22266 ms
[05/28/2024-01:43:32] [I] D2H Latency: min = 0.418945 ms, max = 0.460205 ms, mean = 0.440824 ms, median = 0.432373 ms, percentile(90%) = 0.457764 ms, percentile(95%) = 0.45813 ms, percentile(99%) = 0.45874 ms
[05/28/2024-01:43:32] [I] Total Host Walltime: 3.00356 s
[05/28/2024-01:43:32] [I] Total GPU Compute Time: 2.67983 s
[05/28/2024-01:43:32] [W] * GPU compute time is unstable, with coefficient of variance = 4.13154%.
[05/28/2024-01:43:32] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-01:43:32] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-01:43:32] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v100001] # trtexec --fp16 --avgRuns=1000 --useSpinWait --loadEngine=yolov8n.engine
  1. Export YOLOv10 and test speed
yolo export model=yolov10n.pt format=onnx opset=13 simplify
trtexec --onnx=yolov10n.onnx --saveEngine=yolov10n.engine --fp16
trtexec --fp16 --avgRuns=1000 --useSpinWait --loadEngine=yolov10n.engine
[05/28/2024-01:43:58] [I] === Performance summary ===
[05/28/2024-01:43:58] [I] Throughput: 870.097 qps
[05/28/2024-01:43:58] [I] Latency: min = 1.83008 ms, max = 2.08215 ms, mean = 1.92808 ms, median = 1.92896 ms, percentile(90%) = 1.93481 ms, percentile(95%) = 1.93713 ms, percentile(99%) = 1.94312 ms
[05/28/2024-01:43:58] [I] Enqueue Time: min = 0.448425 ms, max = 0.541504 ms, mean = 0.455154 ms, median = 0.453735 ms, percentile(90%) = 0.460876 ms, percentile(95%) = 0.463135 ms, percentile(99%) = 0.470215 ms
[05/28/2024-01:43:58] [I] H2D Latency: min = 0.770386 ms, max = 0.779297 ms, mean = 0.775283 ms, median = 0.775269 ms, percentile(90%) = 0.775879 ms, percentile(95%) = 0.776001 ms, percentile(99%) = 0.776428 ms
[05/28/2024-01:43:58] [I] GPU Compute Time: min = 1.05078 ms, max = 1.30176 ms, mean = 1.14739 ms, median = 1.14868 ms, percentile(90%) = 1.1543 ms, percentile(95%) = 1.15637 ms, percentile(99%) = 1.1626 ms
[05/28/2024-01:43:58] [I] D2H Latency: min = 0.00341797 ms, max = 0.00720215 ms, mean = 0.00540887 ms, median = 0.00549316 ms, percentile(90%) = 0.00622559 ms, percentile(95%) = 0.00646973 ms, percentile(99%) = 0.00683594 ms
[05/28/2024-01:43:58] [I] Total Host Walltime: 3.00311 s
[05/28/2024-01:43:58] [I] Total GPU Compute Time: 2.99813 s
[05/28/2024-01:43:58] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-01:43:58] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v100001] # trtexec --fp16 --avgRuns=1000 --useSpinWait --loadEngine=yolov10n.engine
laugh12321 commented 5 months ago

谢谢。正如我们在 #29 中所讨论的,没有必要在 YOLOv10 中添加 EfficientNMS 插件。此外,您使用 NMS 插件对 TensorRT 引擎进行基准测试的方式可能会有偏差,因为由于随机输入,分类分数输出毫无意义,即 NMS 的时间不正确。

@NeuralAIM @jameslahm 我为 Yolov10n (#29) 和 Yolov8n (https://github.com/laugh12321/TensorRT-YOLO) 添加了 EfficientNMS 插件(conf 0.25, iou 0.65, max_det 100) 进行后处理,以确保它们的输入和输出节点计数和维度相同。然后,我在配备 GPU RTX 2080Ti 22GB、AMD Ryzen 7 5700X 8 核处理器和 128GB RAM 的系统上进行了性能测试。结果如下。trtexec --fp16

YOLOv10n

trtexec --onnx=yolov10n.onnx --saveEngine=yolov10n.engine --fp16
[05/28/2024-09:31:49] [I] === Performance summary ===
[05/28/2024-09:31:49] [I] Throughput: 452.717 qps
[05/28/2024-09:31:49] [I] Latency: min = 1.95093 ms, max = 3.67439 ms, mean = 2.20258 ms, median = 2.13184 ms, percentile(90%) = 2.44647 ms, percentile(95%) = 2.63928 ms, percentile(99%) = 2.93274 ms
[05/28/2024-09:31:49] [I] Enqueue Time: min = 1.09033 ms, max = 3.86938 ms, mean = 2.10023 ms, median = 2.07922 ms, percentile(90%) = 2.35352 ms, percentile(95%) = 2.63934 ms, percentile(99%) = 3.11755 ms
[05/28/2024-09:31:49] [I] H2D Latency: min = 0.376801 ms, max = 0.538086 ms, mean = 0.385375 ms, median = 0.378418 ms, percentile(90%) = 0.401978 ms, percentile(95%) = 0.410156 ms, percentile(99%) = 0.456177 ms
[05/28/2024-09:31:49] [I] GPU Compute Time: min = 1.55652 ms, max = 3.25937 ms, mean = 1.79524 ms, median = 1.72491 ms, percentile(90%) = 2.03784 ms, percentile(95%) = 2.2323 ms, percentile(99%) = 2.49048 ms
[05/28/2024-09:31:49] [I] D2H Latency: min = 0.0078125 ms, max = 0.123169 ms, mean = 0.0219708 ms, median = 0.0233154 ms, percentile(90%) = 0.0400391 ms, percentile(95%) = 0.0466309 ms, percentile(99%) = 0.0657959 ms
[05/28/2024-09:31:49] [I] Total Host Walltime: 3.00629 s
[05/28/2024-09:31:49] [I] Total GPU Compute Time: 2.44332 s
[05/28/2024-09:31:49] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/28/2024-09:31:49] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/28/2024-09:31:49] [W] * GPU compute time is unstable, with coefficient of variance = 11.1813%.
[05/28/2024-09:31:49] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-09:31:49] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-09:31:49] [I]

YOLOv8n

trtexec --onnx=yolov8n.onnx --saveEngine=yolov8n.engine --fp16
[05/28/2024-09:37:20] [I] === Performance summary ===
[05/28/2024-09:37:20] [I] Throughput: 507.55 qps
[05/28/2024-09:37:20] [I] Latency: min = 1.73242 ms, max = 2.74728 ms, mean = 1.93398 ms, median = 1.90881 ms, percentile(90%) = 2.03448 ms, percentile(95%) = 2.1394 ms, percentile(99%) = 2.45068 ms
[05/28/2024-09:37:20] [I] Enqueue Time: min = 0.944092 ms, max = 2.87231 ms, mean = 1.86921 ms, median = 1.87592 ms, percentile(90%) = 1.95801 ms, percentile(95%) = 2.00525 ms, percentile(99%) = 2.55115 ms
[05/28/2024-09:37:20] [I] H2D Latency: min = 0.377319 ms, max = 0.496002 ms, mean = 0.385183 ms, median = 0.378784 ms, percentile(90%) = 0.402039 ms, percentile(95%) = 0.404053 ms, percentile(99%) = 0.443115 ms
[05/28/2024-09:37:20] [I] GPU Compute Time: min = 1.31958 ms, max = 2.3558 ms, mean = 1.52645 ms, median = 1.50269 ms, percentile(90%) = 1.62378 ms, percentile(95%) = 1.72607 ms, percentile(99%) = 2.02539 ms
[05/28/2024-09:37:20] [I] D2H Latency: min = 0.0078125 ms, max = 0.109375 ms, mean = 0.0223493 ms, median = 0.0247803 ms, percentile(90%) = 0.0297852 ms, percentile(95%) = 0.0429688 ms, percentile(99%) = 0.0550537 ms
[05/28/2024-09:37:20] [I] Total Host Walltime: 3.00266 s
[05/28/2024-09:37:20] [I] Total GPU Compute Time: 2.32631 s
[05/28/2024-09:37:20] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/28/2024-09:37:20] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/28/2024-09:37:20] [W] * GPU compute time is unstable, with coefficient of variance = 6.9708%.
[05/28/2024-09:37:20] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-09:37:20] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-09:37:20] [I]

Thank you for the response. In https://github.com/THU-MIG/yolov10/pull/29#issuecomment-2131005872, the test results of YOLOv10s using v10postprocess and EfficientNMS for post-processing showed that, even with random inputs, their latency difference was within 0.18ms (percentile(99%)). Meanwhile, when both YOLOv8n and YOLOv10n used EfficientNMS for post-processing, the latency difference was 0.48ms (percentile(99%)).

jameslahm commented 5 months ago

Thanks. We want to clarify that the way you benchmark the TensorRT engines with NMS plugin may be biased due to the random input, i.e., the result may vary each time. Besides, we can observe that in the benchmark result of @wsy-yjys whose device is similar with yours, the end-to-end latency of YOLOv10 is 1000 / 870.097=1.149ms and the model forward latency of YOLOv8 is 1.1190ms. Then, when only considering the model forward latency of YOLOv10, its latency should exclude the time of postprocessing from the end-to-end latency of 1.149ms, that is the gap of model forward latency between YOLOv10 and YOLOv8 is smaller than 0.03ms.

谢谢。正如我们在 #29 中所讨论的,没有必要在 YOLOv10 中添加 EfficientNMS 插件。此外,您使用 NMS 插件对 TensorRT 引擎进行基准测试的方式可能会有偏差,因为由于随机输入,分类分数输出毫无意义,即 NMS 的时间不正确。

@NeuralAIM @jameslahm 我为 Yolov10n (#29) 和 Yolov8n (https://github.com/laugh12321/TensorRT-YOLO) 添加了 EfficientNMS 插件(conf 0.25, iou 0.65, max_det 100) 进行后处理,以确保它们的输入和输出节点计数和维度相同。然后,我在配备 GPU RTX 2080Ti 22GB、AMD Ryzen 7 5700X 8 核处理器和 128GB RAM 的系统上进行了性能测试。结果如下。trtexec --fp16

YOLOv10n

trtexec --onnx=yolov10n.onnx --saveEngine=yolov10n.engine --fp16
[05/28/2024-09:31:49] [I] === Performance summary ===
[05/28/2024-09:31:49] [I] Throughput: 452.717 qps
[05/28/2024-09:31:49] [I] Latency: min = 1.95093 ms, max = 3.67439 ms, mean = 2.20258 ms, median = 2.13184 ms, percentile(90%) = 2.44647 ms, percentile(95%) = 2.63928 ms, percentile(99%) = 2.93274 ms
[05/28/2024-09:31:49] [I] Enqueue Time: min = 1.09033 ms, max = 3.86938 ms, mean = 2.10023 ms, median = 2.07922 ms, percentile(90%) = 2.35352 ms, percentile(95%) = 2.63934 ms, percentile(99%) = 3.11755 ms
[05/28/2024-09:31:49] [I] H2D Latency: min = 0.376801 ms, max = 0.538086 ms, mean = 0.385375 ms, median = 0.378418 ms, percentile(90%) = 0.401978 ms, percentile(95%) = 0.410156 ms, percentile(99%) = 0.456177 ms
[05/28/2024-09:31:49] [I] GPU Compute Time: min = 1.55652 ms, max = 3.25937 ms, mean = 1.79524 ms, median = 1.72491 ms, percentile(90%) = 2.03784 ms, percentile(95%) = 2.2323 ms, percentile(99%) = 2.49048 ms
[05/28/2024-09:31:49] [I] D2H Latency: min = 0.0078125 ms, max = 0.123169 ms, mean = 0.0219708 ms, median = 0.0233154 ms, percentile(90%) = 0.0400391 ms, percentile(95%) = 0.0466309 ms, percentile(99%) = 0.0657959 ms
[05/28/2024-09:31:49] [I] Total Host Walltime: 3.00629 s
[05/28/2024-09:31:49] [I] Total GPU Compute Time: 2.44332 s
[05/28/2024-09:31:49] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/28/2024-09:31:49] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/28/2024-09:31:49] [W] * GPU compute time is unstable, with coefficient of variance = 11.1813%.
[05/28/2024-09:31:49] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-09:31:49] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-09:31:49] [I]

YOLOv8n

trtexec --onnx=yolov8n.onnx --saveEngine=yolov8n.engine --fp16
[05/28/2024-09:37:20] [I] === Performance summary ===
[05/28/2024-09:37:20] [I] Throughput: 507.55 qps
[05/28/2024-09:37:20] [I] Latency: min = 1.73242 ms, max = 2.74728 ms, mean = 1.93398 ms, median = 1.90881 ms, percentile(90%) = 2.03448 ms, percentile(95%) = 2.1394 ms, percentile(99%) = 2.45068 ms
[05/28/2024-09:37:20] [I] Enqueue Time: min = 0.944092 ms, max = 2.87231 ms, mean = 1.86921 ms, median = 1.87592 ms, percentile(90%) = 1.95801 ms, percentile(95%) = 2.00525 ms, percentile(99%) = 2.55115 ms
[05/28/2024-09:37:20] [I] H2D Latency: min = 0.377319 ms, max = 0.496002 ms, mean = 0.385183 ms, median = 0.378784 ms, percentile(90%) = 0.402039 ms, percentile(95%) = 0.404053 ms, percentile(99%) = 0.443115 ms
[05/28/2024-09:37:20] [I] GPU Compute Time: min = 1.31958 ms, max = 2.3558 ms, mean = 1.52645 ms, median = 1.50269 ms, percentile(90%) = 1.62378 ms, percentile(95%) = 1.72607 ms, percentile(99%) = 2.02539 ms
[05/28/2024-09:37:20] [I] D2H Latency: min = 0.0078125 ms, max = 0.109375 ms, mean = 0.0223493 ms, median = 0.0247803 ms, percentile(90%) = 0.0297852 ms, percentile(95%) = 0.0429688 ms, percentile(99%) = 0.0550537 ms
[05/28/2024-09:37:20] [I] Total Host Walltime: 3.00266 s
[05/28/2024-09:37:20] [I] Total GPU Compute Time: 2.32631 s
[05/28/2024-09:37:20] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/28/2024-09:37:20] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/28/2024-09:37:20] [W] * GPU compute time is unstable, with coefficient of variance = 6.9708%.
[05/28/2024-09:37:20] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-09:37:20] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-09:37:20] [I]

Thank you for the response. In #29 (comment), the test results of YOLOv10s using v10postprocess and EfficientNMS for post-processing showed that, even with random inputs, their latency difference was within 0.18ms (percentile(99%)). Meanwhile, when both YOLOv8n and YOLOv10n used EfficientNMS for post-processing, the latency difference was 0.48ms (percentile(99%)).

laugh12321 commented 5 months ago

@jameslahm Thank you for your clarification. Perhaps my previous comparison method was unfair. Now, I have removed the v10postprocess post-processing from YOLOv10, retaining only the model's forward processing to be consistent with YOLOv8.

The modification to v10Detect is as follows:

    def forward(self, x):
        one2one = self.forward_feat([xi.detach() for xi in x], self.one2one_cv2, self.one2one_cv3)
        if not self.export:
            one2many = super().forward(x)

        if not self.training:
            one2one = self.inference(one2one)
            if not self.export:
                return {"one2many": one2many, "one2one": one2one}
            else:
                assert(self.max_det != -1)
                return one2one
                # boxes, scores, labels = ops.v10postprocess(one2one.permute(0, 2, 1), self.max_det, self.nc)
                # return torch.cat([boxes, scores.unsqueeze(-1), labels.unsqueeze(-1)], dim=-1)
        else:
            return {"one2many": one2many, "one2one": one2one}

Subsequent trtexec --fp16 test results are as follows:

yolov8n

image

[05/28/2024-11:01:47] [I] === Performance summary ===
[05/28/2024-11:01:47] [I] Throughput: 558.147 qps
[05/28/2024-11:01:47] [I] Latency: min = 1.8656 ms, max = 3.60669 ms, mean = 2.08368 ms, median = 2.03232 ms, percentile(90%) = 2.25513 ms, percentile(95%) = 2.46301 ms, percentile(99%) = 2.7439 ms
[05/28/2024-11:01:47] [I] Enqueue Time: min = 0.743652 ms, max = 3.02124 ms, mean = 1.6494 ms, median = 1.67993 ms, percentile(90%) = 1.9364 ms, percentile(95%) = 2.16943 ms, percentile(99%) = 2.58569 ms
[05/28/2024-11:01:47] [I] H2D Latency: min = 0.381104 ms, max = 0.745361 ms, mean = 0.417279 ms, median = 0.411377 ms, percentile(90%) = 0.44043 ms, percentile(95%) = 0.460205 ms, percentile(99%) = 0.498779 ms
[05/28/2024-11:01:47] [I] GPU Compute Time: min = 1.17566 ms, max = 2.76221 ms, mean = 1.41826 ms, median = 1.36017 ms, percentile(90%) = 1.6062 ms, percentile(95%) = 1.81885 ms, percentile(99%) = 2.08856 ms
[05/28/2024-11:01:47] [I] D2H Latency: min = 0.217285 ms, max = 0.296387 ms, mean = 0.248139 ms, median = 0.253357 ms, percentile(90%) = 0.258606 ms, percentile(95%) = 0.259033 ms, percentile(99%) = 0.272217 ms
[05/28/2024-11:01:47] [I] Total Host Walltime: 3.00458 s
[05/28/2024-11:01:47] [I] Total GPU Compute Time: 2.37843 s
[05/28/2024-11:01:47] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/28/2024-11:01:47] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/28/2024-11:01:47] [W] * GPU compute time is unstable, with coefficient of variance = 12.2596%.
[05/28/2024-11:01:47] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-11:01:47] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-11:01:47] [I]

yolov10n

image

[05/28/2024-11:08:37] [I] === Performance summary ===
[05/28/2024-11:08:37] [I] Throughput: 482.106 qps
[05/28/2024-11:08:37] [I] Latency: min = 2.07495 ms, max = 3.29332 ms, mean = 2.35743 ms, median = 2.30127 ms, percentile(90%) = 2.58521 ms, percentile(95%) = 2.69824 ms, percentile(99%) = 2.89526 ms
[05/28/2024-11:08:37] [I] Enqueue Time: min = 0.985107 ms, max = 3.60083 ms, mean = 1.97798 ms, median = 1.97943 ms, percentile(90%) = 2.37891 ms, percentile(95%) = 2.71021 ms, percentile(99%) = 3.12744 ms
[05/28/2024-11:08:37] [I] H2D Latency: min = 0.380432 ms, max = 0.539886 ms, mean = 0.41434 ms, median = 0.4104 ms, percentile(90%) = 0.436829 ms, percentile(95%) = 0.447388 ms, percentile(99%) = 0.48822 ms
[05/28/2024-11:08:37] [I] GPU Compute Time: min = 1.40918 ms, max = 2.62343 ms, mean = 1.69528 ms, median = 1.63431 ms, percentile(90%) = 1.94995 ms, percentile(95%) = 2.05469 ms, percentile(99%) = 2.245 ms
[05/28/2024-11:08:37] [I] D2H Latency: min = 0.217285 ms, max = 0.306885 ms, mean = 0.24781 ms, median = 0.252197 ms, percentile(90%) = 0.258606 ms, percentile(95%) = 0.258972 ms, percentile(99%) = 0.27124 ms
[05/28/2024-11:08:37] [I] Total Host Walltime: 3.00349 s
[05/28/2024-11:08:37] [I] Total GPU Compute Time: 2.45476 s
[05/28/2024-11:08:37] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/28/2024-11:08:37] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/28/2024-11:08:37] [W] * GPU compute time is unstable, with coefficient of variance = 10.2077%.
[05/28/2024-11:08:37] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-11:08:37] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-11:08:37] [I]
jameslahm commented 5 months ago

Thanks. Is your environment similar to @wsy-yjys 's? What are the results if you follow @wsy-yjys 's benchmark command?

laugh12321 commented 5 months ago

Thanks. Is your environment similar to @wsy-yjys 's? What are the results if you follow @wsy-yjys 's benchmark command?

System Information

image image

Benchmark

YOLOv8n

image

trtexec --fp16 --avgRuns=1000 --useSpinWait --loadEngine=yolov8n.engine
[05/28/2024-11:31:29] [I] === Trace details ===
[05/28/2024-11:31:29] [I] Trace averages of 1000 runs:
[05/28/2024-11:31:29] [I] Average on 1000 runs - GPU latency: 1.39377 ms - Host latency: 2.05883 ms (enqueue 1.69882 ms)
[05/28/2024-11:31:29] [I]
[05/28/2024-11:31:29] [I] === Performance summary ===
[05/28/2024-11:31:29] [I] Throughput: 554.603 qps
[05/28/2024-11:31:29] [I] Latency: min = 1.83154 ms, max = 3.55713 ms, mean = 2.06661 ms, median = 2.02911 ms, percentile(90%) = 2.2157 ms, percentile(95%) = 2.30354 ms, percentile(99%) = 2.56082 ms
[05/28/2024-11:31:29] [I] Enqueue Time: min = 0.816162 ms, max = 2.88916 ms, mean = 1.67346 ms, median = 1.69965 ms, percentile(90%) = 1.82739 ms, percentile(95%) = 1.96362 ms, percentile(99%) = 2.4436 ms
[05/28/2024-11:31:29] [I] H2D Latency: min = 0.380249 ms, max = 0.531006 ms, mean = 0.414946 ms, median = 0.408752 ms, percentile(90%) = 0.439331 ms, percentile(95%) = 0.450165 ms, percentile(99%) = 0.481445 ms
[05/28/2024-11:31:29] [I] GPU Compute Time: min = 1.16016 ms, max = 2.95239 ms, mean = 1.40298 ms, median = 1.36151 ms, percentile(90%) = 1.5708 ms, percentile(95%) = 1.66321 ms, percentile(99%) = 1.90479 ms
[05/28/2024-11:31:29] [I] D2H Latency: min = 0.217041 ms, max = 0.296143 ms, mean = 0.248693 ms, median = 0.252441 ms, percentile(90%) = 0.258484 ms, percentile(95%) = 0.258789 ms, percentile(99%) = 0.270142 ms
[05/28/2024-11:31:29] [I] Total Host Walltime: 3.00395 s
[05/28/2024-11:31:29] [I] Total GPU Compute Time: 2.33736 s
[05/28/2024-11:31:29] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/28/2024-11:31:29] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/28/2024-11:31:29] [W] * GPU compute time is unstable, with coefficient of variance = 9.62775%.
[05/28/2024-11:31:29] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-11:31:29] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-11:31:29] [I]

YOLOv10n without v10postprocess

image

trtexec --fp16 --avgRuns=1000 --useSpinWait --loadEngine=yolov10n.engine
[05/28/2024-11:29:50] [I] === Trace details ===
[05/28/2024-11:29:50] [I] Trace averages of 1000 runs:
[05/28/2024-11:29:50] [I] Average on 1000 runs - GPU latency: 1.68408 ms - Host latency: 2.35322 ms (enqueue 2.07476 ms)
[05/28/2024-11:29:50] [I]
[05/28/2024-11:29:50] [I] === Performance summary ===
[05/28/2024-11:29:50] [I] Throughput: 469.203 qps
[05/28/2024-11:29:50] [I] Latency: min = 2.14148 ms, max = 3.26465 ms, mean = 2.34296 ms, median = 2.32764 ms, percentile(90%) = 2.42896 ms, percentile(95%) = 2.51807 ms, percentile(99%) = 2.75903 ms
[05/28/2024-11:29:50] [I] Enqueue Time: min = 1.10425 ms, max = 3.29858 ms, mean = 2.06007 ms, median = 2.06177 ms, percentile(90%) = 2.1449 ms, percentile(95%) = 2.18994 ms, percentile(99%) = 2.4021 ms
[05/28/2024-11:29:50] [I] H2D Latency: min = 0.38208 ms, max = 0.513916 ms, mean = 0.416314 ms, median = 0.411987 ms, percentile(90%) = 0.436279 ms, percentile(95%) = 0.437988 ms, percentile(99%) = 0.45166 ms
[05/28/2024-11:29:50] [I] GPU Compute Time: min = 1.46167 ms, max = 2.57251 ms, mean = 1.67436 ms, median = 1.65729 ms, percentile(90%) = 1.75928 ms, percentile(95%) = 1.85248 ms, percentile(99%) = 2.0999 ms
[05/28/2024-11:29:50] [I] D2H Latency: min = 0.217041 ms, max = 0.281982 ms, mean = 0.25229 ms, median = 0.253296 ms, percentile(90%) = 0.256439 ms, percentile(95%) = 0.258667 ms, percentile(99%) = 0.259552 ms
[05/28/2024-11:29:50] [I] Total Host Walltime: 3.00296 s
[05/28/2024-11:29:50] [I] Total GPU Compute Time: 2.35917 s
[05/28/2024-11:29:50] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/28/2024-11:29:50] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/28/2024-11:29:50] [W] * GPU compute time is unstable, with coefficient of variance = 5.97861%.
[05/28/2024-11:29:50] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-11:29:50] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-11:29:50] [I]
jameslahm commented 5 months ago

Thanks. What are the versions of your TensorRT, CUDA, PyTorch, and onnx? Are they similar to @wsy-yjys 's?

laugh12321 commented 5 months ago

Thanks. What are the versions of your TensorRT, CUDA, PyTorch, and onnx? Are they similar to @wsy-yjys 's?

Thank you for your question. Here are the versions of my environment:

jameslahm commented 5 months ago

Thanks. How do you obtain the onnx and TensorRT engine?

laugh12321 commented 5 months ago

Thanks. How do you obtain the onnx and TensorRT engine?

Similar to the steps of @wsy-yjys

yolo export model=model.pt format=onnx opset=13 simplify
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16
jnulzl commented 5 months ago

Softwares:

trtexec --onnx=yolov10n_without_topK.onnx --saveEngine=yolov10n_without_topK.engine --fp16 --workspace=1024 --dumpProfile

YoloV10n(without postprocess, model output shape:1x8400x84) at Tesla T4 avg iter time:1.8752ms

jameslahm commented 5 months ago

Thanks for the evaluation. It is similar to our benchmark results.

Softwares:

  • CUDA: v11.7
  • TensorRT: v8.6.1
  • cuDNN: v8.6.0
  • PyTorch: v2.2.0

trtexec --onnx=yolov10n_without_topK.onnx --saveEngine=yolov10n_without_topK.engine --fp16 --workspace=1024 --dumpProfile

YoloV10n(without postprocess, model output shape:1x8400x84) at Tesla T4 avg iter time:1.8752ms

jameslahm commented 5 months ago

@laugh12321 Thanks. We follow your steps to run the benchmark in our local 2080Ti device. Specifically, we change the forward of v10Detect like yours. And we obtain the onnx and TensorRT engines like your steps. The environment is TensorRT=8.5.1.7 and CUDA=11.4. The results are below: YOLOv8-N:

[05/28/2024-15:53:11] [I] === Performance summary ===
[05/28/2024-15:53:11] [I] Throughput: 940.095 qps
[05/28/2024-15:53:11] [I] Latency: min = 1.85425 ms, max = 5.05402 ms, mean = 2.0092 ms, median = 2.04193 ms, percentile(90%) = 2.07666 ms, percentile(95%) = 2.08423 ms, percentile(99%) = 2.10303 ms
[05/28/2024-15:53:11] [I] Enqueue Time: min = 0.556152 ms, max = 4.2431 ms, mean = 0.611582 ms, median = 0.615051 ms, percentile(90%) = 0.638916 ms, percentile(95%) = 0.644531 ms, percentile(99%) = 0.670776 ms
[05/28/2024-15:53:11] [I] H2D Latency: min = 0.522949 ms, max = 0.92804 ms, mean = 0.576852 ms, median = 0.584961 ms, percentile(90%) = 0.612549 ms, percentile(95%) = 0.616699 ms, percentile(99%) = 0.624512 ms
[05/28/2024-15:53:11] [I] GPU Compute Time: min = 0.94812 ms, max = 3.79083 ms, mean = 1.05916 ms, median = 1.06946 ms, percentile(90%) = 1.09229 ms, percentile(95%) = 1.09766 ms, percentile(99%) = 1.10651 ms
[05/28/2024-15:53:11] [I] D2H Latency: min = 0.341553 ms, max = 0.703735 ms, mean = 0.373191 ms, median = 0.375244 ms, percentile(90%) = 0.400635 ms, percentile(95%) = 0.402222 ms, percentile(99%) = 0.404068 ms
[05/28/2024-15:53:11] [I] Total Host Walltime: 3.00289 s
[05/28/2024-15:53:11] [I] Total GPU Compute Time: 2.99001 s
[05/28/2024-15:53:11] [W] * GPU compute time is unstable, with coefficient of variance = 7.57166%.
[05/28/2024-15:53:11] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-15:53:11] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-15:53:11] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8501] # trtexec --fp16 --avgRuns=1000 --useSpinWait --loadEngine=yolov8n.engine

And YOLOv10-N:

[05/28/2024-15:42:19] [I] === Performance summary ===
[05/28/2024-15:42:19] [I] Throughput: 871.879 qps
[05/28/2024-15:42:19] [I] Latency: min = 1.91602 ms, max = 4.78894 ms, mean = 2.06769 ms, median = 2.08716 ms, percentile(90%) = 2.15527 ms, percentile(95%) = 2.19171 ms, percentile(99%) = 2.42987 ms
[05/28/2024-15:42:19] [I] Enqueue Time: min = 0.756104 ms, max = 4.34082 ms, mean = 0.822439 ms, median = 0.828339 ms, percentile(90%) = 0.861267 ms, percentile(95%) = 0.870361 ms, percentile(99%) = 0.894348 ms
[05/28/2024-15:42:19] [I] H2D Latency: min = 0.523071 ms, max = 0.919312 ms, mean = 0.571011 ms, median = 0.58374 ms, percentile(90%) = 0.609375 ms, percentile(95%) = 0.615051 ms, percentile(99%) = 0.624268 ms
[05/28/2024-15:42:19] [I] GPU Compute Time: min = 1.00964 ms, max = 3.59668 ms, mean = 1.14097 ms, median = 1.13452 ms, percentile(90%) = 1.19946 ms, percentile(95%) = 1.22504 ms, percentile(99%) = 1.47461 ms
[05/28/2024-15:42:19] [I] D2H Latency: min = 0.329224 ms, max = 0.681885 ms, mean = 0.355714 ms, median = 0.354248 ms, percentile(90%) = 0.373779 ms, percentile(95%) = 0.380859 ms, percentile(99%) = 0.389648 ms
[05/28/2024-15:42:19] [I] Total Host Walltime: 2.68156 s
[05/28/2024-15:42:19] [I] Total GPU Compute Time: 2.66759 s
[05/28/2024-15:42:19] [W] * GPU compute time is unstable, with coefficient of variance = 9.68331%.
[05/28/2024-15:42:19] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-15:42:19] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-15:42:19] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8501] # trtexec --fp16 --avgRuns=1000 --useSpinWait --loadEngine=yolov10n.engine

The latency gap is about 0.083ms on 2080Ti.

Thanks. How do you obtain the onnx and TensorRT engine?

Similar to the steps of @wsy-yjys

yolo export model=model.pt format=onnx opset=13 simplify
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16
laugh12321 commented 5 months ago

@jameslahm Thank you for your verification, which suggests that there might be an issue with my device. When I test using trtexec --fp16 --avgRuns=1000 --useSpinWait, the throughput for both YOLOv8 and YOLOv10 is around 500. It's only when I add --useCudaGraph that the throughput reaches around 900. The discrepancies between the two might be amplified due to my device.

YOLOv8-N

[05/28/2024-16:15:12] [I] === Performance summary ===
[05/28/2024-16:15:12] [I] Throughput: 932.761 qps
[05/28/2024-16:15:12] [I] Latency: min = 1.60461 ms, max = 2.67409 ms, mean = 1.67041 ms, median = 1.61734 ms, percentile(90%) = 1.85669 ms, percentile(95%) = 2.05627 ms, percentile(99%) = 2.20776 ms
[05/28/2024-16:15:12] [I] Enqueue Time: min = 0.0493164 ms, max = 0.113892 ms, mean = 0.0547657 ms, median = 0.0541992 ms, percentile(90%) = 0.0567627 ms, percentile(95%) = 0.0582275 ms, percentile(99%) = 0.0695801 ms
[05/28/2024-16:15:12] [I] H2D Latency: min = 0.37793 ms, max = 0.433105 ms, mean = 0.382 ms, median = 0.381317 ms, percentile(90%) = 0.384277 ms, percentile(95%) = 0.386475 ms, percentile(99%) = 0.400604 ms
[05/28/2024-16:15:12] [I] GPU Compute Time: min = 1.00781 ms, max = 2.07545 ms, mean = 1.06991 ms, median = 1.01782 ms, percentile(90%) = 1.25696 ms, percentile(95%) = 1.45612 ms, percentile(99%) = 1.60828 ms
[05/28/2024-16:15:12] [I] D2H Latency: min = 0.217529 ms, max = 0.229248 ms, mean = 0.218498 ms, median = 0.218384 ms, percentile(90%) = 0.219116 ms, percentile(95%) = 0.219482 ms, percentile(99%) = 0.222412 ms
[05/28/2024-16:15:12] [I] Total Host Walltime: 3.00291 s
[05/28/2024-16:15:12] [I] Total GPU Compute Time: 2.99682 s
[05/28/2024-16:15:12] [W] * GPU compute time is unstable, with coefficient of variance = 13.3728%.
[05/28/2024-16:15:12] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-16:15:12] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-16:15:12] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100001] # C:\Program Files\NVIDIA GPU Computing Toolkit\TensorRT\v10.0.1.6\bin\trtexec.exe --fp16 --avgRuns=1000 --useSpinWait --loadEngine=yolov8n.engine --useCudaGraph

YOLOv10-N

[05/28/2024-16:14:45] [I] === Performance summary ===
[05/28/2024-16:14:45] [I] Throughput: 845.612 qps
[05/28/2024-16:14:45] [I] Latency: min = 1.7168 ms, max = 2.79471 ms, mean = 1.78192 ms, median = 1.72723 ms, percentile(90%) = 1.96423 ms, percentile(95%) = 2.17529 ms, percentile(99%) = 2.39041 ms
[05/28/2024-16:14:45] [I] Enqueue Time: min = 0.0612793 ms, max = 0.210449 ms, mean = 0.0764301 ms, median = 0.0759277 ms, percentile(90%) = 0.0820923 ms, percentile(95%) = 0.092041 ms, percentile(99%) = 0.098114 ms
[05/28/2024-16:14:45] [I] H2D Latency: min = 0.378845 ms, max = 0.439453 ms, mean = 0.383169 ms, median = 0.38208 ms, percentile(90%) = 0.385254 ms, percentile(95%) = 0.389038 ms, percentile(99%) = 0.404114 ms
[05/28/2024-16:14:45] [I] GPU Compute Time: min = 1.11743 ms, max = 2.19418 ms, mean = 1.18026 ms, median = 1.12646 ms, percentile(90%) = 1.35986 ms, percentile(95%) = 1.57471 ms, percentile(99%) = 1.78943 ms
[05/28/2024-16:14:45] [I] D2H Latency: min = 0.217285 ms, max = 0.241455 ms, mean = 0.218489 ms, median = 0.218262 ms, percentile(90%) = 0.218994 ms, percentile(95%) = 0.219482 ms, percentile(99%) = 0.22522 ms
[05/28/2024-16:14:45] [I] Total Host Walltime: 3.00374 s
[05/28/2024-16:14:45] [I] Total GPU Compute Time: 2.99786 s
[05/28/2024-16:14:45] [W] * GPU compute time is unstable, with coefficient of variance = 12.3332%.
[05/28/2024-16:14:45] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[05/28/2024-16:14:45] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/28/2024-16:14:45] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100001] # C:\Program Files\NVIDIA GPU Computing Toolkit\TensorRT\v10.0.1.6\bin\trtexec.exe --fp16 --avgRuns=1000 --useSpinWait --loadEngine=yolov10n.engine --useCudaGraph
zhangyunming commented 5 months ago

Softwares:

  • CUDA: v11.7
  • TensorRT: v8.6.1
  • cuDNN: v8.6.0
  • PyTorch: v2.2.0

trtexec --onnx=yolov10n_without_topK.onnx --saveEngine=yolov10n_without_topK.engine --fp16 --workspace=1024 --dumpProfile

YoloV10n(without postprocess, model output shape:1x8400x84) at Tesla T4 avg iter time:1.8752ms

but the yoloV6-3.0 , the speed as bellow, same card T4 and tensorRT f16: 1716884804146 the speed is 1000/779 = 1.283ms

is faster than v10n‘s 1.8752ms

jameslahm commented 5 months ago

Thanks. It may be caused by different ways of measuring the latency. For example, the input may be the real image or the random input, and the flags of trtexec may also be different. For the benchmark details, please refer to our paper or RT-DETR.

Softwares:

  • CUDA: v11.7
  • TensorRT: v8.6.1
  • cuDNN: v8.6.0
  • PyTorch: v2.2.0

trtexec --onnx=yolov10n_without_topK.onnx --saveEngine=yolov10n_without_topK.engine --fp16 --workspace=1024 --dumpProfile YoloV10n(without postprocess, model output shape:1x8400x84) at Tesla T4 avg iter time:1.8752ms

but the yoloV6-3.0 , the speed as bellow, same card T4 and tensorRT f16: 1716884804146 the speed is 1000/779 = 1.283ms

is faster than v10n‘s 1.8752ms

jameslahm commented 5 months ago

Please feel free to reopen this issue if you have further questions.

jasonlytehouse commented 4 months ago

I've seen https://github.com/ultralytics/ultralytics/issues/13825 that Python3.12 performance is much better than 3.11 down with v10, still verifying if this is actually true

Rbrq03 commented 2 months ago

@jasonlytehouse Do you have any result on this verification?

jasonlytehouse commented 2 months ago

Work took over and I had to move on from this so no unfortunately not, feel free to test for yourself