slower inference speed of TensorRT 10.0 on GPU Tesla T4

HSDai commented 2 months ago

Description

I convert nafnet from onnx to tensorrt on Tesla T4 with TensorRT 10.0. However, the inference speed is much slower than engine converted from TensorRT 8.6.

TensorRT 10.0: [05/24/2024-14:43:21] [I] === Trace details === [05/24/2024-14:43:21] [I] Trace averages of 10 runs: [05/24/2024-14:43:21] [I] Average on 10 runs - GPU latency: 539.803 ms - Host latency: 546.901 ms (enqueue 4.92217 ms) [05/24/2024-14:43:21] [I] [05/24/2024-14:43:21] [I] === Performance summary === [05/24/2024-14:43:21] [I] Throughput: 1.64966 qps [05/24/2024-14:43:21] [I] Latency: min = 542.295 ms, max = 550.235 ms, mean = 546.901 ms, median = 546.891 ms, percentile(90%) = 550.032 ms, percentile(95%) = 550.235 ms, percentile(99%) = 550.235 ms [05/24/2024-14:43:21] [I] Enqueue Time: min = 3.92992 ms, max = 5.48389 ms, mean = 4.92217 ms, median = 5.14417 ms, percentile(90%) = 5.33893 ms, percentile(95%) = 5.48389 ms, percentile(99%) = 5.48389 ms [05/24/2024-14:43:21] [I] H2D Latency: min = 3.60913 ms, max = 4.47997 ms, mean = 3.70715 ms, median = 3.62408 ms, percentile(90%) = 3.63037 ms, percentile(95%) = 4.47997 ms, percentile(99%) = 4.47997 ms [05/24/2024-14:43:21] [I] GPU Compute Time: min = 535.282 ms, max = 543.216 ms, mean = 539.803 ms, median = 539.882 ms, percentile(90%) = 543.027 ms, percentile(95%) = 543.216 ms, percentile(99%) = 543.216 ms [05/24/2024-14:43:21] [I] D2H Latency: min = 3.38086 ms, max = 3.40747 ms, mean = 3.3907 ms, median = 3.38916 ms, percentile(90%) = 3.39551 ms, percentile(95%) = 3.40747 ms, percentile(99%) = 3.40747 ms [05/24/2024-14:43:21] [I] Total Host Walltime: 6.06185 s [05/24/2024-14:43:21] [I] Total GPU Compute Time: 5.39803 s [05/24/2024-14:43:21] [I] Explanations of the performance metrics are printed in the verbose logs. [05/24/2024-14:43:21] [I] &&&& PASSED TensorRT.trtexec [TensorRT v100001] # ./trtexec --loadEngine=nafnetcc75_t4_float32_v10.trtmodel --shapes=input:1x1920x1920x3 --device=3

TensorRT 8.6: [05/24/2024-14:44:43] [I] === Trace details === [05/24/2024-14:44:43] [I] Trace averages of 10 runs: [05/24/2024-14:44:43] [I] Average on 10 runs - GPU latency: 143.531 ms - Host latency: 150.62 ms (enqueue 4.77478 ms) [05/24/2024-14:44:43] [I] Average on 10 runs - GPU latency: 141.829 ms - Host latency: 148.839 ms (enqueue 5.34015 ms) [05/24/2024-14:44:43] [I] [05/24/2024-14:44:43] [I] === Performance summary === [05/24/2024-14:44:43] [I] Throughput: 6.59775 qps [05/24/2024-14:44:43] [I] Latency: min = 147.611 ms, max = 165.985 ms, mean = 149.754 ms, median = 148.669 ms, percentile(90%) = 151.169 ms, percentile(95%) = 151.494 ms, percentile(99%) = 165.985 ms [05/24/2024-14:44:43] [I] Enqueue Time: min = 2.2744 ms, max = 5.82202 ms, mean = 5.09928 ms, median = 5.2124 ms, percentile(90%) = 5.76062 ms, percentile(95%) = 5.77234 ms, percentile(99%) = 5.82202 ms [05/24/2024-14:44:43] [I] H2D Latency: min = 3.60007 ms, max = 4.53885 ms, mean = 3.65205 ms, median = 3.61035 ms, percentile(90%) = 3.63367 ms, percentile(95%) = 3.63477 ms, percentile(99%) = 4.53885 ms [05/24/2024-14:44:43] [I] GPU Compute Time: min = 140.629 ms, max = 158.058 ms, mean = 142.711 ms, median = 141.668 ms, percentile(90%) = 144.174 ms, percentile(95%) = 144.487 ms, percentile(99%) = 158.058 ms [05/24/2024-14:44:43] [I] D2H Latency: min = 3.38074 ms, max = 3.40759 ms, mean = 3.3908 ms, median = 3.38867 ms, percentile(90%) = 3.40186 ms, percentile(95%) = 3.40405 ms, percentile(99%) = 3.40759 ms [05/24/2024-14:44:43] [I] Total Host Walltime: 3.48604 s [05/24/2024-14:44:43] [I] Total GPU Compute Time: 3.28235 s [05/24/2024-14:44:43] [W] * GPU compute time is unstable, with coefficient of variance = 2.41332%. [05/24/2024-14:44:43] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability. [05/24/2024-14:44:43] [I] Explanations of the performance metrics are printed in the verbose logs. [05/24/2024-14:44:43] [I] &&&& PASSED TensorRT.trtexec [TensorRT v8601] # ./trtexec --loadEngine=nafnetcc75_t4_float32_v86.trtmodel --shapes=input:1x1920x1920x3 --device=3

detail log: trt10.log

trt8.6.log

Environment

TensorRT Version:10.0

NVIDIA GPU:Telsa T4

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link: onnx.zip

trt10.zip

trt86.zip

Steps To Reproduce

./trtexec --onnx=color_consistency_nafnet.onnx --saveEngine=nafnetcc75_t4_float32_v10.trtmodel --inputIOFormats=fp32:chw --outputIOFormats=fp32:chw --device=3 --minShapes=input:1x64x64x3 --optShapes=input:1x1024x1024x3 --maxShapes=input:1x1920x1920x3

./trtexec --loadEngine=nafnetcc75_t4_float32_v10.trtmodel --shapes=input:1x1920x1920x3 --device=3

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

lix19937 commented 1 month ago

You can compare the layer info time-profile and fusion tactic.

HSDai commented 1 month ago

You can compare the layer info time-profile and fusion tactic.

I can't get performance profile because there are some errors, while execute trtexec with --dumpProfile. dumpProfile.log without dumpProfile.log

But it's ok compiled with tensorrt8.6. dumpProfile_v86.log

Could this be related to the slower inference speed? How can I find out the reason? Thank you very much.

zerollzeng commented 1 month ago

Thanks, I can repro the issue and filed internal bug 4672320 to track this.

zerollzeng commented 1 month ago

You can try add --builderOptimizationLevel=5 to WAR this, we are still working on the real fix.

HSDai commented 1 month ago

You can try add --builderOptimizationLevel=5 to WAR this, we are still working on the real fix.

Thank you, that's helpful!

geraldstanje commented 1 month ago

hi, is there a profiler you can run for triton inference server?

geraldstanje1 commented 1 month ago

https://aws.amazon.com/blogs/machine-learning/host-ml-models-on-amazon-sagemaker-using-triton-onnx-models/

From: geraldstanje @.> Date: Sunday, June 9, 2024 at 5:06 PM To: NVIDIA/TensorRT @.> Cc: Gerald Stanje (gstanje) @.>, Manual @.> Subject: Re: [NVIDIA/TensorRT] slower inference speed of TensorRT 10.0 on GPU Tesla T4 (Issue #3896)

hi, is there a profiler you can run for triton inference server?

— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/TensorRT/issues/3896#issuecomment-2156789635, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BGZOXMORNGNFLMTOQX5MYW3ZGS7UZAVCNFSM6AAAAABIHJOUOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJWG44DSNRTGU. You are receiving this because you are subscribed to this thread.Message ID: @.***>

HSDai commented 1 month ago

hi, is there a profiler you can run for triton inference server?

no, I haven't used triton inference server before.

nvpohanh commented 1 month ago

We are actively investigating this issue. Meanwhile, you can work around this regression by setting the optimization level in builder config to 5 or adding --builderOptimizationLevel=5 flag to the trtexec command. Thanks

zerollzeng commented 2 days ago

Fixed in TRT 10.3, closed.

NVIDIA / TensorRT