This project showcases inference with PyTorch CNN models, such as ResNet50, EfficientNet, and MobileNet, and their optimization using ONNX, OpenVINO, and NVIDIA TensorRT. The script infers a user-specified image and displays top-K predictions. Benchmarking covers configurations like PyTorch CPU, ONNX CPU, OpenVINO CPU, PyTorch CUDA, TensorRT-FP32, and TensorRT-FP16.
The project is Dockerized for easy deployment:
PyTorch CPU
, ONNX CPU
, and OpenVINO CPU
models only).PyTorch CPU
, ONNX CPU
, OpenVINO CPU
, PyTorch CUDA
, TensorRT-FP32
, and TensorRT-FP16
).Please look at the Steps to Run section for Docker instructions.
git clone https://github.com/DimaBir/ResNetTensorRT.git
CPU-only Deployment:
docker build -t cpu_img .
Running:
docker run -it --rm cpu_img /bin/bash
GPU (CUDA) Deployment:
docker build --build-arg ENVIRONMENT=gpu --build-arg BASE_IMAGE=nvcr.io/nvidia/tensorrt:23.08-py3 -t gpu_img .
Running:
docker run --gpus all -it --rm gpu_img
python main.py [--mode all]
--image_path
: (Optional) Specifies the path to the image you want to predict.--topk
: (Optional) Specifies the number of top predictions to show. Defaults to 5 if not provided.--mode
: (Optional) Specifies the model's mode for exporting and running. Choices are: onnx
, ov
, cpu
, cuda
, tensorrt
, and all
. If not provided, it defaults to all
.python main.py --topk 3 --mode=all --image_path="./inference/cat3.jpg"
This command will run predictions on the chosen image (./inference/cat3.jpg
), show the top 3 predictions, and run all available models. Note: plot created only for --mode=all
and results plotted and saved to ./inference/plot.png
Here is an example of the input image to run predictions and benchmarks on:
Average Inference Time: This plot showcases the average time taken for inference across different model types and optimization techniques. The y-axis represents the model type (e.g., PyTorch CPU, TensorRT FP16, etc.), and the x-axis represents the average inference time in milliseconds. The shorter the bar, the faster the inference time.
Throughput: This plot compares the throughput achieved by different model types. Throughput is measured in terms of the number of images processed per second. The y-axis represents the model type, and the x-axis represents the throughput. A higher bar indicates better throughput, meaning the model can process more images in a given time frame.
These plots offer a comprehensive view of the performance improvements achieved by various inference optimization techniques, especially when leveraging TensorRT with different precision types like FP16 and FP32.
#1: 15% Egyptian cat
#2: 14% tiger cat
#3: 9% tabby
#4: 2% doormat
#5: 2% lynx
PyTorch_cpu: 31.93 ms
indicates the average batch time when running the PyTorch
model on CPU
device.PyTorch_cuda: 5.70 ms
indicates the average batch time when running the PyTorch
model on the CUDA
device.TRT_fp32: 1.69 ms
shows the average batch time when running the model with TensorRT
using float32
precision.TRT_fp16: 0.75 ms
indicates the average batch time when running the model with TensorRT
using float16
precision.ONNX: 16.25 ms
indicates the average batch inference time when running the PyTorch
converted to the ONNX
model on the CPU
device.OpenVINO: 15.00 ms
indicates the average batch inference time when running the ONNX
model converted to OpenVINO
on the CPU
device.#1: 15% Egyptian cat
#2: 14% tiger cat
#3: 9% tabby
#4: 2% doormat
#5: 2% lynx
#1: 15% Egyptian cat
#2: 14% tiger cat
#3: 9% tabby
#4: 2% doormat
#5: 2% lynx
Here you can see the flow for each model and benchmark.
In the provided code, we perform inference using the native PyTorch framework on both CPU and GPU (CUDA) configurations. This is a baseline to compare the performance improvements gained from other optimization techniques.
TensorRT offers significant performance improvements by optimizing the neural network model. This code uses TensorRT's capabilities to run benchmarks in FP32 (single precision) and FP16 (half precision) modes.
The code includes an exporter that converts the PyTorch ResNet-50 model to ONNX format, allowing it to be inferred using ONNX Runtime. This provides a flexible, cross-platform solution for deploying the model.
OpenVINO is a toolkit from Intel that optimizes deep learning model inference for Intel CPUs, GPUs, and other hardware. We convert the ONNX model to OpenVINO's format in the code and then run benchmarks using the OpenVINO runtime.