NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
I am encountering performance bottlenecks while running multi-threaded inference on high-resolution images using TensorRT. The model involves breaking the image into patches to manage GPU memory, performing inference on each patch, and then merging the results. However, the inference time per patch is still high, even when increasing the batch size. Additionally, loading multiple engines onto the GPU to parallelize the inference does not yield the expected speedup. I am seeking advice on optimizing the inference process for faster execution, either by improving batch processing or enabling better parallelism in TensorRT.
Build the Engine: Use build_engine to convert an ONNX model into a TensorRT engine.
Run Inference: Use TRTModel to perform inference on cropped image patches.
Expected Result: While batch sizes are increased, the inference time per patch remains high. Running multiple engines for parallel inference also does not improve performance.
Profiling Results:
Transfer to device: 0.48 ms
Inference time: 784.75 ms
Transfer to host: 0.67 ms
Total time for a single patch (256x256): 19-22 seconds on average
I am seeking optimization suggestions for improving multi-batch processing or multi-threaded parallel inference in TensorRT.
Description
I am encountering performance bottlenecks while running multi-threaded inference on high-resolution images using TensorRT. The model involves breaking the image into patches to manage GPU memory, performing inference on each patch, and then merging the results. However, the inference time per patch is still high, even when increasing the batch size. Additionally, loading multiple engines onto the GPU to parallelize the inference does not yield the expected speedup. I am seeking advice on optimizing the inference process for faster execution, either by improving batch processing or enabling better parallelism in TensorRT.
Environment
TensorRT Version: 10.5.0
GPU Type: RTX 3050TI 4GB
Nvidia Driver Version: 535.183.01
CUDA Version: 12.2
CUDNN Version: N/A
Operating System + Version: Ubuntu 20.04
Python Version: 3.11
TensorFlow Version: N/A
PyTorch Version: N/A
Baremetal or Container (if container, which image + tag): Baremetal
Relevant Files
build_engine.py
inference.py
Steps To Reproduce
build_engine
to convert an ONNX model into a TensorRT engine.TRTModel
to perform inference on cropped image patches.I am seeking optimization suggestions for improving multi-batch processing or multi-threaded parallel inference in TensorRT.