NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.31k stars 2.09k forks source link

Tensorrt fp32 inference slower than pytorch on tesla T4 for GroundingDINO #3611

Closed shuchang0714 closed 2 months ago

shuchang0714 commented 6 months ago

Description

I convert groundingdino from torch to tensorrt on A100, which can accelarate 50% on inference. However, when I deploy the same model on T4, after I rebuild engine, the inference speed on tensorrt fp32 is slower than it on pytorch.

Environment

TensorRT Version:8.6.1.6

NVIDIA GPU:Tesla T4

NVIDIA Driver Version:535.129.03

CUDA Version:11.7

CUDNN Version:8.6

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):1.8

Baremetal or Container (if so, version):

Relevant Files

Inference time for torch is around 650ms,

Info for building engine is as followed:

[01/18/2024-06:58:22] [I] === Model Options ===

[01/18/2024-06:58:22] [I] Format: ONNX

[01/18/2024-06:58:22] [I] Model: /workspace/groundingdino.onnx

[01/18/2024-06:58:22] [I] Output:

[01/18/2024-06:58:22] [I] === Build Options ===

[01/18/2024-06:58:22] [I] Max batch: explicit batch

[01/18/2024-06:58:22] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default

[01/18/2024-06:58:22] [I] minTiming: 1

[01/18/2024-06:58:22] [I] avgTiming: 8

[01/18/2024-06:58:22] [I] Precision: FP32

[01/18/2024-06:58:22] [I] LayerPrecisions:

[01/18/2024-06:58:22] [I] Layer Device Types:

[01/18/2024-06:58:22] [I] Calibration:

[01/18/2024-06:58:22] [I] Refit: Disabled

[01/18/2024-06:58:22] [I] Version Compatible: Disabled

[01/18/2024-06:58:22] [I] TensorRT runtime: full

[01/18/2024-06:58:22] [I] Lean DLL Path:

[01/18/2024-06:58:22] [I] Tempfile Controls: { in_memory: allow, temporary: allow }

[01/18/2024-06:58:22] [I] Exclude Lean Runtime: Disabled

[01/18/2024-06:58:22] [I] Sparsity: Disabled

[01/18/2024-06:58:22] [I] Safe mode: Disabled

[01/18/2024-06:58:22] [I] Build DLA standalone loadable: Disabled

[01/18/2024-06:58:22] [I] Allow GPU fallback for DLA: Disabled

[01/18/2024-06:58:22] [I] DirectIO mode: Disabled

[01/18/2024-06:58:22] [I] Restricted mode: Disabled

[01/18/2024-06:58:22] [I] Skip inference: Disabled

[01/18/2024-06:58:22] [I] Save engine: /workspace/groundingdino.trt

[01/18/2024-06:58:22] [I] Load engine:

[01/18/2024-06:58:22] [I] Profiling verbosity: 0

[01/18/2024-06:58:22] [I] Tactic sources: Using default tactic sources

[01/18/2024-06:58:22] [I] timingCacheMode: local

[01/18/2024-06:58:22] [I] timingCacheFile:

[01/18/2024-06:58:22] [I] Heuristic: Disabled

[01/18/2024-06:58:22] [I] Preview Features: Use default preview flags.

[01/18/2024-06:58:22] [I] MaxAuxStreams: -1

[01/18/2024-06:58:22] [I] BuilderOptimizationLevel: -1

[01/18/2024-06:58:22] [I] Input(s)s format: fp32:CHW

[01/18/2024-06:58:22] [I] Output(s)s format: fp32:CHW

[01/18/2024-06:58:22] [I] Input build shape: bert_output=1x1x768+1x6x768+1x256x768

[01/18/2024-06:58:22] [I] Input build shape: img=1x3x800x1200+1x3x800x1200+1x3x800x1200

[01/18/2024-06:58:22] [I] Input build shape: attention_mask=1x1+1x6+1x256

[01/18/2024-06:58:22] [I] Input build shape: position_ids=1x1+1x6+1x256

[01/18/2024-06:58:22] [I] Input build shape: object_mask=1x256+1x256+1x256

[01/18/2024-06:58:22] [I] Input build shape: text_token_mask=1x1x1+1x6x6+1x256x256

[01/18/2024-06:58:22] [I] Input calibration shapes: model

[01/18/2024-06:58:22] [I] === System Options ===

[01/18/2024-06:58:22] [I] Device: 0

[01/18/2024-06:58:22] [I] DLACore:

[01/18/2024-06:58:22] [I] Plugins:

[01/18/2024-06:58:22] [I] setPluginsToSerialize:

[01/18/2024-06:58:22] [I] dynamicPlugins:

[01/18/2024-06:58:22] [I] ignoreParsedPluginLibs: 0

[01/18/2024-06:58:22] [I]

[01/18/2024-06:58:22] [I] === Inference Options ===

[01/18/2024-06:58:22] [I] Batch: Explicit

[01/18/2024-06:58:22] [I] Input inference shape: text_token_mask=1x6x6

[01/18/2024-06:58:22] [I] Input inference shape: object_mask=1x256

[01/18/2024-06:58:22] [I] Input inference shape: position_ids=1x6

[01/18/2024-06:58:22] [I] Input inference shape: attention_mask=1x6

[01/18/2024-06:58:22] [I] Input inference shape: bert_output=1x6x768

[01/18/2024-06:58:22] [I] Input inference shape: img=1x3x800x1200

[01/18/2024-06:58:22] [I] Iterations: 10

[01/18/2024-06:58:22] [I] Duration: 3s (+ 200ms warm up)

[01/18/2024-06:58:22] [I] Sleep time: 0ms

[01/18/2024-06:58:22] [I] Idle time: 0ms

[01/18/2024-06:58:22] [I] Inference Streams: 1

[01/18/2024-06:58:22] [I] ExposeDMA: Disabled

[01/18/2024-06:58:22] [I] Data transfers: Enabled

[01/18/2024-06:58:22] [I] Spin-wait: Disabled

[01/18/2024-06:58:22] [I] Multithreading: Disabled

[01/18/2024-06:58:22] [I] CUDA Graph: Disabled

[01/18/2024-06:58:22] [I] Separate profiling: Disabled

[01/18/2024-06:58:22] [I] Time Deserialize: Disabled

[01/18/2024-06:58:22] [I] Time Refit: Disabled

[01/18/2024-06:58:22] [I] NVTX verbosity: 0

[01/18/2024-06:58:22] [I] Persistent Cache Ratio: 0

[01/18/2024-06:58:22] [I] Inputs:

[01/18/2024-06:58:22] [I] === Reporting Options ===

[01/18/2024-06:58:22] [I] Verbose: Disabled

[01/18/2024-06:58:22] [I] Averages: 10 inferences

[01/18/2024-06:58:22] [I] Percentiles: 90,95,99

[01/18/2024-06:58:22] [I] Dump refittable layers:Disabled

[01/18/2024-06:58:22] [I] Dump output: Disabled

[01/18/2024-06:58:22] [I] Profile: Disabled

[01/18/2024-06:58:22] [I] Export timing to JSON file:

[01/18/2024-06:58:22] [I] Export output to JSON file:

[01/18/2024-06:58:22] [I] Export profile to JSON file:

[01/18/2024-06:58:22] [I]

[01/18/2024-06:58:24] [I] === Device Information ===

[01/18/2024-06:58:24] [I] Selected Device: Tesla T4

[01/18/2024-06:58:24] [I] Compute Capability: 7.5

[01/18/2024-06:58:24] [I] SMs: 40

[01/18/2024-06:58:24] [I] Device Global Memory: 14930 MiB

[01/18/2024-06:58:24] [I] Shared Memory per SM: 64 KiB

[01/18/2024-06:58:24] [I] Memory Bus Width: 256 bits (ECC enabled)

[01/18/2024-06:58:24] [I] Application Compute Clock Rate: 1.59 GHz

[01/18/2024-06:58:24] [I] Application Memory Clock Rate: 5.001 GHz

[01/18/2024-06:58:24] [I]

[01/18/2024-06:58:24] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.

[01/18/2024-06:58:24] [I]

[01/18/2024-06:58:24] [I] TensorRT version: 8.6.1

[01/18/2024-06:58:24] [I] Loading standard plugins

[01/18/2024-06:58:25] [I] [TRT] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 19, GPU 105 (MiB)

[01/18/2024-06:58:32] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +896, GPU +174, now: CPU 991, GPU 279 (MiB)

[01/18/2024-06:58:32] [I] Start parsing network model.

[01/18/2024-06:58:32] [I] [TRT] ----------------------------------------------------------------

[01/18/2024-06:58:32] [I] [TRT] Input filename: /workspace/groundingdino.onnx

[01/18/2024-06:58:32] [I] [TRT] ONNX IR version: 0.0.8

[01/18/2024-06:58:32] [I] [TRT] Opset version: 16

[01/18/2024-06:58:32] [I] [TRT] Producer name: pytorch

[01/18/2024-06:58:32] [I] [TRT] Producer version: 1.13.1

[01/18/2024-06:58:32] [I] [TRT] Domain:

[01/18/2024-06:58:32] [I] [TRT] Model version: 0

[01/18/2024-06:58:32] [I] [TRT] Doc string:

[01/18/2024-06:58:32] [I] [TRT] ----------------------------------------------------------------

[01/18/2024-06:58:33] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.

[01/18/2024-06:58:33] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped

[01/18/2024-06:58:35] [I] Finished parsing network model. Parse time: 3.2296

[01/18/2024-06:58:35] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.

[01/18/2024-06:58:40] [I] [TRT] Graph optimization time: 3.76623 seconds.

[01/18/2024-06:58:40] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2470, GPU 559 (MiB)

[01/18/2024-06:58:40] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2470, GPU 569 (MiB)

[01/18/2024-06:58:40] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.6.0

[01/18/2024-06:58:40] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.

[01/18/2024-06:58:40] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.

[01/18/2024-07:02:38] [I] [TRT] Detected 6 inputs and 2 output network tensors.

[01/18/2024-07:02:42] [I] [TRT] Total Host Persistent Memory: 43424

[01/18/2024-07:02:42] [I] [TRT] Total Device Persistent Memory: 475648

[01/18/2024-07:02:42] [I] [TRT] Total Scratch Memory: 780636672

[01/18/2024-07:02:42] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 50 MiB, GPU 1109 MiB

[01/18/2024-07:02:42] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 343 steps to complete.

[01/18/2024-07:02:43] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 246.787ms to assign 54 blocks to 343 nodes requiring 942283264 bytes.

[01/18/2024-07:02:43] [I] [TRT] Total Activation Memory: 942278144

[01/18/2024-07:02:43] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3598, GPU 1187 (MiB)

[01/18/2024-07:02:43] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3598, GPU 1195 (MiB)

[01/18/2024-07:02:43] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.6.0

[01/18/2024-07:02:44] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +502, now: CPU 0, GPU 502 (MiB)

[01/18/2024-07:02:44] [I] Engine built in 259.893 sec.

[01/18/2024-07:02:45] [I] [TRT] Loaded engine size: 513 MiB

[01/18/2024-07:02:45] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2163, GPU 935 (MiB)

[01/18/2024-07:02:45] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 2164, GPU 943 (MiB)

[01/18/2024-07:02:45] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.6.0

[01/18/2024-07:02:45] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +501, now: CPU 0, GPU 501 (MiB)

[01/18/2024-07:02:45] [I] Engine deserialized in 0.493383 sec.

[01/18/2024-07:02:45] [I] [TRT] [MS] Running engine with multi stream info

[01/18/2024-07:02:45] [I] [TRT] [MS] Number of aux streams is 7

[01/18/2024-07:02:45] [I] [TRT] [MS] Number of total worker streams is 8

[01/18/2024-07:02:45] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream

[01/18/2024-07:02:45] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2163, GPU 935 (MiB)

[01/18/2024-07:02:46] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 2164, GPU 943 (MiB)

[01/18/2024-07:02:46] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.6.0

[01/18/2024-07:02:46] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +900, now: CPU 0, GPU 1401 (MiB)

[01/18/2024-07:02:46] [I] Setting persistentCacheLimit to 0 bytes.

[01/18/2024-07:02:46] [I] Using random values for input img

[01/18/2024-07:02:46] [I] Input binding for img with dimensions 1x3x800x1200 is created.

[01/18/2024-07:02:46] [I] Using random values for input bert_output

[01/18/2024-07:02:46] [I] Input binding for bert_output with dimensions 1x6x768 is created.

[01/18/2024-07:02:46] [I] Using random values for input attention_mask

[01/18/2024-07:02:46] [I] Input binding for attention_mask with dimensions 1x6 is created.

[01/18/2024-07:02:46] [I] Using random values for input position_ids

[01/18/2024-07:02:46] [I] Input binding for position_ids with dimensions 1x6 is created.

[01/18/2024-07:02:46] [I] Using random values for input text_token_mask

[01/18/2024-07:02:46] [I] Input binding for text_token_mask with dimensions 1x6x6 is created.

[01/18/2024-07:02:46] [I] Using random values for input object_mask

[01/18/2024-07:02:46] [I] Input binding for object_mask with dimensions 1x256 is created.

[01/18/2024-07:02:46] [I] Output binding for logits with dimensions 1x900x256 is created.

[01/18/2024-07:02:46] [I] Output binding for boxes with dimensions 1x900x4 is created.

[01/18/2024-07:02:46] [I] Starting inference

[01/18/2024-07:02:53] [I] Warmup completed 1 queries over 200 ms

[01/18/2024-07:02:53] [I] Timing trace has 10 queries over 6.24858 s

[01/18/2024-07:02:53] [I]

[01/18/2024-07:02:53] [I] === Trace details ===

[01/18/2024-07:02:53] [I] Trace averages of 10 runs:

[01/18/2024-07:02:53] [I] Average on 10 runs - GPU latency: 590.353 ms - Host latency: 592.789 ms (enqueue 589.294 ms)

[01/18/2024-07:02:53] [I]

[01/18/2024-07:02:53] [I] === Performance summary ===

[01/18/2024-07:02:53] [I] Throughput: 1.60036 qps

[01/18/2024-07:02:53] [I] Latency: min = 588.442 ms, max = 595.703 ms, mean = 592.789 ms, median = 592.715 ms, percentile(90%) = 595.005 ms, percentile(95%) = 595.703 ms, percentile(99%) = 595.703 ms

[01/18/2024-07:02:53] [I] Enqueue Time: min = 576.717 ms, max = 593.965 ms, mean = 589.294 ms, median = 590.206 ms, percentile(90%) = 593.607 ms, percentile(95%) = 593.965 ms, percentile(99%) = 593.965 ms

[01/18/2024-07:02:53] [I] H2D Latency: min = 2.2395 ms, max = 2.43123 ms, mean = 2.26939 ms, median = 2.25073 ms, percentile(90%) = 2.26904 ms, percentile(95%) = 2.43123 ms, percentile(99%) = 2.43123 ms

[01/18/2024-07:02:53] [I] GPU Compute Time: min = 586.03 ms, max = 593.103 ms, mean = 590.353 ms, median = 590.308 ms, percentile(90%) = 592.596 ms, percentile(95%) = 593.103 ms, percentile(99%) = 593.103 ms

[01/18/2024-07:02:53] [I] D2H Latency: min = 0.147949 ms, max = 0.171143 ms, mean = 0.166638 ms, median = 0.168518 ms, percentile(90%) = 0.170166 ms, percentile(95%) = 0.171143 ms, percentile(99%) = 0.171143 ms

[01/18/2024-07:02:53] [I] Total Host Walltime: 6.24858 s

[01/18/2024-07:02:53] [I] Total GPU Compute Time: 5.90353 s

[01/18/2024-07:02:53] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.

[01/18/2024-07:02:53] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.

[01/18/2024-07:02:53] [I] Explanations of the performance metrics are printed in the verbose logs.

[01/18/2024-07:02:53] [I]

&&&& PASSED TensorRT.trtexec [TensorRT v8601] # ./trtexec --onnx=/workspace/groundingdino.onnx --saveEngine=/workspace/groundingdino.trt --minShapes=img:1x3x800x1200,bert_output:1x1x768,attention_mask:1x1,position_ids:1x1,text_token_mask:1x1x1,object_mask:1x256 --optShapes=img:1x3x800x1200,bert_output:1x6x768,attention_mask:1x6,position_ids:1x6,text_token_mask:1x6x6,object_mask:1x256 --maxShapes=img:1x3x800x1200,bert_output:1x256x768,attention_mask:1x256,position_ids:1x256,text_token_mask:1x256x256,object_mask:1x256

zerollzeng commented 6 months ago

@nvpohanh Is this expected? (torch 650ms vs trt 590.308 ms)

zerollzeng commented 6 months ago

T4 is pretty old GPU, maybe we just don't have much optimized kernel for it?

lix19937 commented 6 months ago

Different ai-frameworks on different arch gpu have different layer kernel impl.

ttyio commented 2 months ago

closing since no activity for more than 3 weeks per our policy, thanks all!