NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
Apache License 2.0
10.31k stars 2.09k forks source link

Tensorrt fp32 inference slower than pytorch on tesla T4 for GroundingDINO #3611

Closed shuchang0714 closed 2 months ago

shuchang0714 commented 6 months ago


I convert groundingdino from torch to tensorrt on A100, which can accelarate 50% on inference. However, when I deploy the same model on T4, after I rebuild engine, the inference speed on tensorrt fp32 is slower than it on pytorch.


TensorRT Version:


NVIDIA Driver Version:535.129.03

CUDA Version:11.7

CUDNN Version:8.6

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):1.8

Baremetal or Container (if so, version):

Relevant Files

Inference time for torch is around 650ms,

Info for building engine is as followed:

[01/18/2024-06:58:22] [I] === Model Options ===

[01/18/2024-06:58:22] [I] Format: ONNX

[01/18/2024-06:58:22] [I] Model: /workspace/groundingdino.onnx

[01/18/2024-06:58:22] [I] Output:

[01/18/2024-06:58:22] [I] === Build Options ===

[01/18/2024-06:58:22] [I] Max batch: explicit batch

[01/18/2024-06:58:22] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default

[01/18/2024-06:58:22] [I] minTiming: 1

[01/18/2024-06:58:22] [I] avgTiming: 8

[01/18/2024-06:58:22] [I] Precision: FP32

[01/18/2024-06:58:22] [I] LayerPrecisions:

[01/18/2024-06:58:22] [I] Layer Device Types:

[01/18/2024-06:58:22] [I] Calibration:

[01/18/2024-06:58:22] [I] Refit: Disabled

[01/18/2024-06:58:22] [I] Version Compatible: Disabled

[01/18/2024-06:58:22] [I] TensorRT runtime: full

[01/18/2024-06:58:22] [I] Lean DLL Path:

[01/18/2024-06:58:22] [I] Tempfile Controls: { in_memory: allow, temporary: allow }

[01/18/2024-06:58:22] [I] Exclude Lean Runtime: Disabled

[01/18/2024-06:58:22] [I] Sparsity: Disabled

[01/18/2024-06:58:22] [I] Safe mode: Disabled

[01/18/2024-06:58:22] [I] Build DLA standalone loadable: Disabled

[01/18/2024-06:58:22] [I] Allow GPU fallback for DLA: Disabled

[01/18/2024-06:58:22] [I] DirectIO mode: Disabled

[01/18/2024-06:58:22] [I] Restricted mode: Disabled

[01/18/2024-06:58:22] [I] Skip inference: Disabled

[01/18/2024-06:58:22] [I] Save engine: /workspace/groundingdino.trt

[01/18/2024-06:58:22] [I] Load engine:

[01/18/2024-06:58:22] [I] Profiling verbosity: 0

[01/18/2024-06:58:22] [I] Tactic sources: Using default tactic sources

[01/18/2024-06:58:22] [I] timingCacheMode: local

[01/18/2024-06:58:22] [I] timingCacheFile:

[01/18/2024-06:58:22] [I] Heuristic: Disabled

[01/18/2024-06:58:22] [I] Preview Features: Use default preview flags.

[01/18/2024-06:58:22] [I] MaxAuxStreams: -1

[01/18/2024-06:58:22] [I] BuilderOptimizationLevel: -1

[01/18/2024-06:58:22] [I] Input(s)s format: fp32:CHW

[01/18/2024-06:58:22] [I] Output(s)s format: fp32:CHW

[01/18/2024-06:58:22] [I] Input build shape: bert_output=1x1x768+1x6x768+1x256x768

[01/18/2024-06:58:22] [I] Input build shape: img=1x3x800x1200+1x3x800x1200+1x3x800x1200

[01/18/2024-06:58:22] [I] Input build shape: attention_mask=1x1+1x6+1x256

[01/18/2024-06:58:22] [I] Input build shape: position_ids=1x1+1x6+1x256

[01/18/2024-06:58:22] [I] Input build shape: object_mask=1x256+1x256+1x256

[01/18/2024-06:58:22] [I] Input build shape: text_token_mask=1x1x1+1x6x6+1x256x256

[01/18/2024-06:58:22] [I] Input calibration shapes: model

[01/18/2024-06:58:22] [I] === System Options ===

[01/18/2024-06:58:22] [I] Device: 0

[01/18/2024-06:58:22] [I] DLACore:

[01/18/2024-06:58:22] [I] Plugins:

[01/18/2024-06:58:22] [I] setPluginsToSerialize:

[01/18/2024-06:58:22] [I] dynamicPlugins:

[01/18/2024-06:58:22] [I] ignoreParsedPluginLibs: 0

[01/18/2024-06:58:22] [I]

[01/18/2024-06:58:22] [I] === Inference Options ===

[01/18/2024-06:58:22] [I] Batch: Explicit

[01/18/2024-06:58:22] [I] Input inference shape: text_token_mask=1x6x6

[01/18/2024-06:58:22] [I] Input inference shape: object_mask=1x256

[01/18/2024-06:58:22] [I] Input inference shape: position_ids=1x6

[01/18/2024-06:58:22] [I] Input inference shape: attention_mask=1x6

[01/18/2024-06:58:22] [I] Input inference shape: bert_output=1x6x768

[01/18/2024-06:58:22] [I] Input inference shape: img=1x3x800x1200

[01/18/2024-06:58:22] [I] Iterations: 10

[01/18/2024-06:58:22] [I] Duration: 3s (+ 200ms warm up)

[01/18/2024-06:58:22] [I] Sleep time: 0ms

[01/18/2024-06:58:22] [I] Idle time: 0ms

[01/18/2024-06:58:22] [I] Inference Streams: 1

[01/18/2024-06:58:22] [I] ExposeDMA: Disabled

[01/18/2024-06:58:22] [I] Data transfers: Enabled

[01/18/2024-06:58:22] [I] Spin-wait: Disabled

[01/18/2024-06:58:22] [I] Multithreading: Disabled

[01/18/2024-06:58:22] [I] CUDA Graph: Disabled

[01/18/2024-06:58:22] [I] Separate profiling: Disabled

[01/18/2024-06:58:22] [I] Time Deserialize: Disabled

[01/18/2024-06:58:22] [I] Time Refit: Disabled

[01/18/2024-06:58:22] [I] NVTX verbosity: 0

[01/18/2024-06:58:22] [I] Persistent Cache Ratio: 0

[01/18/2024-06:58:22] [I] Inputs:

[01/18/2024-06:58:22] [I] === Reporting Options ===

[01/18/2024-06:58:22] [I] Verbose: Disabled

[01/18/2024-06:58:22] [I] Averages: 10 inferences

[01/18/2024-06:58:22] [I] Percentiles: 90,95,99

[01/18/2024-06:58:22] [I] Dump refittable layers:Disabled

[01/18/2024-06:58:22] [I] Dump output: Disabled

[01/18/2024-06:58:22] [I] Profile: Disabled

[01/18/2024-06:58:22] [I] Export timing to JSON file:

[01/18/2024-06:58:22] [I] Export output to JSON file:

[01/18/2024-06:58:22] [I] Export profile to JSON file:

[01/18/2024-06:58:22] [I]

[01/18/2024-06:58:24] [I] === Device Information ===

[01/18/2024-06:58:24] [I] Selected Device: Tesla T4

[01/18/2024-06:58:24] [I] Compute Capability: 7.5

[01/18/2024-06:58:24] [I] SMs: 40

[01/18/2024-06:58:24] [I] Device Global Memory: 14930 MiB

[01/18/2024-06:58:24] [I] Shared Memory per SM: 64 KiB

[01/18/2024-06:58:24] [I] Memory Bus Width: 256 bits (ECC enabled)

[01/18/2024-06:58:24] [I] Application Compute Clock Rate: 1.59 GHz

[01/18/2024-06:58:24] [I] Application Memory Clock Rate: 5.001 GHz

[01/18/2024-06:58:24] [I]

[01/18/2024-06:58:24] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.

[01/18/2024-06:58:24] [I]

[01/18/2024-06:58:24] [I] TensorRT version: 8.6.1

[01/18/2024-06:58:24] [I] Loading standard plugins

[01/18/2024-06:58:25] [I] [TRT] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 19, GPU 105 (MiB)

[01/18/2024-06:58:32] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +896, GPU +174, now: CPU 991, GPU 279 (MiB)

[01/18/2024-06:58:32] [I] Start parsing network model.

[01/18/2024-06:58:32] [I] [TRT] ----------------------------------------------------------------

[01/18/2024-06:58:32] [I] [TRT] Input filename: /workspace/groundingdino.onnx

[01/18/2024-06:58:32] [I] [TRT] ONNX IR version: 0.0.8

[01/18/2024-06:58:32] [I] [TRT] Opset version: 16

[01/18/2024-06:58:32] [I] [TRT] Producer name: pytorch

[01/18/2024-06:58:32] [I] [TRT] Producer version: 1.13.1

[01/18/2024-06:58:32] [I] [TRT] Domain:

[01/18/2024-06:58:32] [I] [TRT] Model version: 0

[01/18/2024-06:58:32] [I] [TRT] Doc string:

[01/18/2024-06:58:32] [I] [TRT] ----------------------------------------------------------------

[01/18/2024-06:58:33] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.

[01/18/2024-06:58:33] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped

[01/18/2024-06:58:35] [I] Finished parsing network model. Parse time: 3.2296

[01/18/2024-06:58:35] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.

[01/18/2024-06:58:40] [I] [TRT] Graph optimization time: 3.76623 seconds.

[01/18/2024-06:58:40] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2470, GPU 559 (MiB)

[01/18/2024-06:58:40] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2470, GPU 569 (MiB)

[01/18/2024-06:58:40] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.6.0

[01/18/2024-06:58:40] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.

[01/18/2024-06:58:40] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.

[01/18/2024-07:02:38] [I] [TRT] Detected 6 inputs and 2 output network tensors.

[01/18/2024-07:02:42] [I] [TRT] Total Host Persistent Memory: 43424

[01/18/2024-07:02:42] [I] [TRT] Total Device Persistent Memory: 475648

[01/18/2024-07:02:42] [I] [TRT] Total Scratch Memory: 780636672

[01/18/2024-07:02:42] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 50 MiB, GPU 1109 MiB

[01/18/2024-07:02:42] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 343 steps to complete.

[01/18/2024-07:02:43] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 246.787ms to assign 54 blocks to 343 nodes requiring 942283264 bytes.

[01/18/2024-07:02:43] [I] [TRT] Total Activation Memory: 942278144

[01/18/2024-07:02:43] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3598, GPU 1187 (MiB)

[01/18/2024-07:02:43] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3598, GPU 1195 (MiB)

[01/18/2024-07:02:43] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.6.0

[01/18/2024-07:02:44] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +502, now: CPU 0, GPU 502 (MiB)

[01/18/2024-07:02:44] [I] Engine built in 259.893 sec.

[01/18/2024-07:02:45] [I] [TRT] Loaded engine size: 513 MiB

[01/18/2024-07:02:45] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2163, GPU 935 (MiB)

[01/18/2024-07:02:45] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 2164, GPU 943 (MiB)

[01/18/2024-07:02:45] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.6.0

[01/18/2024-07:02:45] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +501, now: CPU 0, GPU 501 (MiB)

[01/18/2024-07:02:45] [I] Engine deserialized in 0.493383 sec.

[01/18/2024-07:02:45] [I] [TRT] [MS] Running engine with multi stream info

[01/18/2024-07:02:45] [I] [TRT] [MS] Number of aux streams is 7

[01/18/2024-07:02:45] [I] [TRT] [MS] Number of total worker streams is 8

[01/18/2024-07:02:45] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream

[01/18/2024-07:02:45] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2163, GPU 935 (MiB)

[01/18/2024-07:02:46] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 2164, GPU 943 (MiB)

[01/18/2024-07:02:46] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.6.0

[01/18/2024-07:02:46] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +900, now: CPU 0, GPU 1401 (MiB)

[01/18/2024-07:02:46] [I] Setting persistentCacheLimit to 0 bytes.

[01/18/2024-07:02:46] [I] Using random values for input img

[01/18/2024-07:02:46] [I] Input binding for img with dimensions 1x3x800x1200 is created.

[01/18/2024-07:02:46] [I] Using random values for input bert_output

[01/18/2024-07:02:46] [I] Input binding for bert_output with dimensions 1x6x768 is created.

[01/18/2024-07:02:46] [I] Using random values for input attention_mask

[01/18/2024-07:02:46] [I] Input binding for attention_mask with dimensions 1x6 is created.

[01/18/2024-07:02:46] [I] Using random values for input position_ids

[01/18/2024-07:02:46] [I] Input binding for position_ids with dimensions 1x6 is created.

[01/18/2024-07:02:46] [I] Using random values for input text_token_mask

[01/18/2024-07:02:46] [I] Input binding for text_token_mask with dimensions 1x6x6 is created.

[01/18/2024-07:02:46] [I] Using random values for input object_mask

[01/18/2024-07:02:46] [I] Input binding for object_mask with dimensions 1x256 is created.

[01/18/2024-07:02:46] [I] Output binding for logits with dimensions 1x900x256 is created.

[01/18/2024-07:02:46] [I] Output binding for boxes with dimensions 1x900x4 is created.

[01/18/2024-07:02:46] [I] Starting inference

[01/18/2024-07:02:53] [I] Warmup completed 1 queries over 200 ms

[01/18/2024-07:02:53] [I] Timing trace has 10 queries over 6.24858 s

[01/18/2024-07:02:53] [I]

[01/18/2024-07:02:53] [I] === Trace details ===

[01/18/2024-07:02:53] [I] Trace averages of 10 runs:

[01/18/2024-07:02:53] [I] Average on 10 runs - GPU latency: 590.353 ms - Host latency: 592.789 ms (enqueue 589.294 ms)

[01/18/2024-07:02:53] [I]

[01/18/2024-07:02:53] [I] === Performance summary ===

[01/18/2024-07:02:53] [I] Throughput: 1.60036 qps

[01/18/2024-07:02:53] [I] Latency: min = 588.442 ms, max = 595.703 ms, mean = 592.789 ms, median = 592.715 ms, percentile(90%) = 595.005 ms, percentile(95%) = 595.703 ms, percentile(99%) = 595.703 ms

[01/18/2024-07:02:53] [I] Enqueue Time: min = 576.717 ms, max = 593.965 ms, mean = 589.294 ms, median = 590.206 ms, percentile(90%) = 593.607 ms, percentile(95%) = 593.965 ms, percentile(99%) = 593.965 ms

[01/18/2024-07:02:53] [I] H2D Latency: min = 2.2395 ms, max = 2.43123 ms, mean = 2.26939 ms, median = 2.25073 ms, percentile(90%) = 2.26904 ms, percentile(95%) = 2.43123 ms, percentile(99%) = 2.43123 ms

[01/18/2024-07:02:53] [I] GPU Compute Time: min = 586.03 ms, max = 593.103 ms, mean = 590.353 ms, median = 590.308 ms, percentile(90%) = 592.596 ms, percentile(95%) = 593.103 ms, percentile(99%) = 593.103 ms

[01/18/2024-07:02:53] [I] D2H Latency: min = 0.147949 ms, max = 0.171143 ms, mean = 0.166638 ms, median = 0.168518 ms, percentile(90%) = 0.170166 ms, percentile(95%) = 0.171143 ms, percentile(99%) = 0.171143 ms

[01/18/2024-07:02:53] [I] Total Host Walltime: 6.24858 s

[01/18/2024-07:02:53] [I] Total GPU Compute Time: 5.90353 s

[01/18/2024-07:02:53] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.

[01/18/2024-07:02:53] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.

[01/18/2024-07:02:53] [I] Explanations of the performance metrics are printed in the verbose logs.

[01/18/2024-07:02:53] [I]

&&&& PASSED TensorRT.trtexec [TensorRT v8601] # ./trtexec --onnx=/workspace/groundingdino.onnx --saveEngine=/workspace/groundingdino.trt --minShapes=img:1x3x800x1200,bert_output:1x1x768,attention_mask:1x1,position_ids:1x1,text_token_mask:1x1x1,object_mask:1x256 --optShapes=img:1x3x800x1200,bert_output:1x6x768,attention_mask:1x6,position_ids:1x6,text_token_mask:1x6x6,object_mask:1x256 --maxShapes=img:1x3x800x1200,bert_output:1x256x768,attention_mask:1x256,position_ids:1x256,text_token_mask:1x256x256,object_mask:1x256

zerollzeng commented 6 months ago

@nvpohanh Is this expected? (torch 650ms vs trt 590.308 ms)

zerollzeng commented 6 months ago

T4 is pretty old GPU, maybe we just don't have much optimized kernel for it?

lix19937 commented 6 months ago

Different ai-frameworks on different arch gpu have different layer kernel impl.

ttyio commented 2 months ago

closing since no activity for more than 3 weeks per our policy, thanks all!