Inference slower on A40 then A30

decadance-dance commented 2 months ago

Description

I am moving from A30 to A40. So I needed to rebuild my onnx model for A40. I rebuilt using the same trtexec version, the same command and the same model via the docker image as I did on A30. The image: nvcr.io/nvidia/tensorrt:24.06-py3 The command:

trtexec --onnx=model.onnx \
        --maxShapes=input:4x3x1024x1024 \
        --minShapes=input:1x3x1024x1024 \
        --optShapes=input:2x3x1024x1024 \
        --fp16 \
        --saveEngine=model.plan

I benchmark my models on both GPUs using Triton Inference Server 2.47.0 and get: A30:

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 15.4337 infer/sec, latency 74841 usec
Concurrency: 2, throughput: 30.3723 infer/sec, latency 77563 usec
Concurrency: 3, throughput: 35.0317 infer/sec, latency 94443 usec
Concurrency: 4, throughput: 37.0215 infer/sec, latency 132680 usec

A40:

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 12.2598 infer/sec, latency 88722 usec
Concurrency: 2, throughput: 24.3012 infer/sec, latency 82894 usec
Concurrency: 3, throughput: 28.9309 infer/sec, latency 104551 usec
Concurrency: 4, throughput: 30.2839 infer/sec, latency 160710 usec

Environment

TensorRT Version: 10.1.0.27

NVIDIA GPU: A40

NVIDIA Driver Version: 555.58.02

CUDA Version: 12.1

Operating System: Ubuntu 22.04

lix19937 commented 2 months ago

Try to add --builderOptimizationLevel=5.

decadance-dance commented 2 months ago

@lix19937 I added this flag but got:

[08/01/2024-08:23:19] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 68727865856 detected for tactic 0x0000000000000018.
[08/01/2024-08:23:20] [W] [TRT] Tactic Device request: 65544MB Available: 45525MB. Device memory is insufficient to use tactic.
[08/01/2024-08:23:20] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 1 due to insufficient memory on requested size of 68727865856 detected for tactic 0x0000000000000019.
[08/01/2024-08:23:20] [W] [TRT] Tactic Device request: 65544MB Available: 45525MB. Device memory is insufficient to use tactic.
[08/01/2024-08:23:20] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 2 due to insufficient memory on requested size of 68727865856 detected for tactic 0x000000000000001a.
[08/01/2024-08:23:20] [W] [TRT] Tactic Device request: 65544MB Available: 45525MB. Device memory is insufficient to use tactic.
[08/01/2024-08:23:20] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 3 due to insufficient memory on requested size of 68727865856 detected for tactic 0x000000000000001b.
[08/01/2024-08:23:20] [W] [TRT] Tactic Device request: 65544MB Available: 45525MB. Device memory is insufficient to use tactic.
[08/01/2024-08:23:20] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 4 due to insufficient memory on requested size of 68727865856 detected for tactic 0x000000000000001f.

Why 45GB VRAM is insufficient?

decadance-dance commented 2 months ago

@lix19937 despite the issues associated with GPU memory, I rebuilt the model with --builderOptimizationLevel=5 but got quite close results:

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 12.5453 infer/sec, latency 81882 usec
Concurrency: 2, throughput: 24.5096 infer/sec, latency 95886 usec
Concurrency: 3, throughput: 28.1778 infer/sec, latency 109527 usec
Concurrency: 4, throughput: 29.4522 infer/sec, latency 168369 usec

So I think either it didn't work at all or the presence of issues affected the result.

lix19937 commented 2 months ago

Why 45GB VRAM is insufficient?

Yes , You can try to add workspace size.

lix19937 commented 2 months ago

BTW, diff hardwares with A30 A40, you should keep the freq stable, and compare the power supply, and use nsight system tools to profile the resource utilize.

NVIDIA / TensorRT

Inference slower on A40 then A30 #4042

Description

Environment