How to use TensorRT C++ API for high performance GPU machine-learning inference.
Supports models with single / multiple inputs and single / multiple outputs with batching.
Project Overview Video
.
Code Deep-Dive Video
I read all the NVIDIA TensorRT docs so that you don't have to!
This project demonstrates how to use the TensorRT C++ API for high performance GPU inference on image data. It covers how to do the following:
The following instructions assume you are using Ubuntu 20.04 or 22.04. You will need to supply your own onnx model for this sample code or you can download the sample model (see Sanity Check section below).
sudo apt install build-essential
sudo snap install cmake --classic
sudo apt install libspdlog-dev libfmt-dev
(for logging)build_opencv.sh
script provided in ./scripts/
.
CUDNN_INCLUDE_DIR
and CUDNN_LIBRARY
variables in the script. CMakeLists.txt
file and replace the TODO
with the path to your TensorRT installation.mkdir build
cd build
cmake ..
make -j$(nproc)
./run_inference_benchmark --onnx_model ../models/yolov8n.onnx
./run_inference_benchmark --trt_model ../models/yolov8n.engine.NVIDIAGeForceRTX3080LaptopGPU.fp16.1.1
YOLOv8n
model from here.pip3 install ultralytics
first.from ultralytics import YOLO
model = YOLO("./yolov8n.pt")
model.fuse()
model.info(verbose=False) # Print model information
model.export(format="onnx", opset=12) # Export the model to onnx using opset 12
yolov8n.onnx
, in the ./models/
directory. ./inputs/team.jpg
should produce the following feature vector:
3.41113 16.5312 20.8828 29.8984 43.7266 54.9609 62.0625 65.8594 70.0312 72.9531 ...
Enabling INT8 precision can further speed up inference at the cost of accuracy reduction due to reduced dynamic range. For INT8 precision, the user must supply calibration data which is representative of real data the model will see. It is advised to use 1K+ calibration images. To enable INT8 inference with the YoloV8 sanity check model, the following steps must be taken:
options.precision = Precision::FP16;
to options.precision = Precision::INT8;
in main.cpp
options.calibrationDataDirectoryPath = "";
must be changed in main.cpp
to specify path containing calibration data.
wget http://images.cocodataset.org/zips/val2017.zip
Int8EntropyCalibrator2::getBatch
method in engine.cpp
(see TODO
) is correct for your model.
.calibration
extension) so that on subsequent model optimizations it can be reused. If you'd like to regenerate the calibration data, you must delete this cache file. Options.calibrationBatchSize
so that the entire batch can fit in your GPU memory. Benchmarks run on RTX 3050 Ti Laptop GPU, 11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz.
Model | Precision | Batch Size | Avg Inference Time |
---|---|---|---|
yolov8n | FP32 | 1 | 4.732 ms |
yolov8n | FP16 | 1 | 2.493 ms |
yolov8n | INT8 | 1 | 2.009 ms |
yolov8x | FP32 | 1 | 76.63 ms |
yolov8x | FP16 | 1 | 25.08 ms |
yolov8x | INT8 | 1 | 11.62 ms |
Wondering how to integrate this library into your project? Or perhaps how to read the outputs of the YoloV8 model to extract meaningful information? If so, check out my two latest projects, YOLOv8-TensorRT-CPP and YOLOv9-TensorRT-CPP, which demonstrate how to use the TensorRT C++ API to run YoloV8/9 inference (supports object detection, semantic segmentation, and body pose estimation). They make use of this project in the backend!
project-root/
├── include/
│ ├── engine/
│ │ ├── EngineRunInference.inl
│ │ ├── EngineUtilities.inl
│ │ └── EngineBuildLoadNetwork.inl
│ ├── util/...
│ ├── ...
├── src/
| ├── ...
│ ├── engine.cpp
│ ├── engine.h
│ └── main.cpp
├── CMakeLists.txt
└── README.md
include/engine
. I have written lots of comments all throughout the code which should make it easy to understand what is going on. include/engine/EngineRunInference.inl
. include/engine/EngineBuildLoadNetwork.inl
.The implementation uses the spdlog
library for logging. You can change the log level by setting the environment variable LOG_LEVEL
to one of the following values: trace
, debug
, info
, warn
, error
, critical
, off
.
If you have issues creating the TensorRT engine file from the onnx model, consider setting the environment variable LOG_LEVEL
to trace
and re-run the application. This should give you more information on where exactly the build process is failing.
If this project was helpful to you, I would appreciate if you could give it a star. That will encourage me to ensure it's up to date and solve issues quickly. I also do consulting work if you require more specific help. Connect with me on LinkedIn.
Loic Tetrel 💻 |
thomaskleiven 💻 |
WiCyn 💻 |
V6.0
V5.0
Engine
class has been modified to take a template parameter which specifies the models output data type. The implementation now supports outputs of type float
, __half
, int8_t
, int32_t
, bool
, and uint8_t
. Options
have been set correctly for your model (for example, if your model has been compiled for FP32 but you try running FP16 inference, it will fail, potentially without a verbose error).V4.1
V4.0
V3.0
IExecutionContext::enqueueV3()
). driver
to run_inference_benchmark
and now must be passed path to onnx model as command line argument. Options.doesSupportDynamicBatchSize
. Implementation now auto-detects supported batch sizes.Options.maxWorkspaceSize
. Implementation now does not limit GPU memory during model constructions, allowing implementation to use as much of memory pool as is available for intermediate layers.v2.2
V2.1
V2.0
Options.optBatchSizes
has been removed, replaced by Options.optBatchSize
.Thanks goes to these wonderful people (emoji key):
This project follows the all-contributors specification. Contributions of any kind welcome!