DimaBir / ResNetTensorRT

Optimized Inference with ResNet-50: A demonstration of inference performance using PyTorch, TensorRT, ONNX, and OpenVINO. Includes benchmarks, predictions, and model exporters.
GNU General Public License v3.0
6 stars 2 forks source link
benchmark deep-learning inference onnx onnxruntime openvino pytorch

Table of Contents

  1. Overview
  2. Requirements
  3. Results
  4. Benchmark Implementation Details New
  5. Author
  6. References

Overview

This project showcases inference with PyTorch CNN models, such as ResNet50, EfficientNet, and MobileNet, and their optimization using ONNX, OpenVINO, and NVIDIA TensorRT. The script infers a user-specified image and displays top-K predictions. Benchmarking covers configurations like PyTorch CPU, ONNX CPU, OpenVINO CPU, PyTorch CUDA, TensorRT-FP32, and TensorRT-FP16.

The project is Dockerized for easy deployment:

  1. CPU-only Deployment - Suitable for non-GPU systems (supports PyTorch CPU, ONNX CPU, and OpenVINO CPU models only).
  2. GPU Deployment - Optimized for NVIDIA GPUs (supports all models: PyTorch CPU, ONNX CPU, OpenVINO CPU, PyTorch CUDA, TensorRT-FP32, and TensorRT-FP16).

Please look at the Steps to Run section for Docker instructions.

Requirements

Steps to Run

Building the Docker Image

  1. CPU-only Deployment:

    docker build -t cpu_img .

    Running:

    docker run -it --rm cpu_img /bin/bash
  2. GPU (CUDA) Deployment:

    docker build --build-arg ENVIRONMENT=gpu --build-arg BASE_IMAGE=nvcr.io/nvidia/tensorrt:23.08-py3 -t gpu_img .

    Running:

    docker run --gpus all -it --rm gpu_img

Run the Script inside the Container

python main.py [--mode all]

Arguments

Example Command

python main.py --topk 3 --mode=all --image_path="./inference/cat3.jpg"

This command will run predictions on the chosen image (./inference/cat3.jpg), show the top 3 predictions, and run all available models. Note: plot created only for --mode=all and results plotted and saved to ./inference/plot.png

Results

Example Input

Here is an example of the input image to run predictions and benchmarks on:

Plot details:

  1. Average Inference Time: This plot showcases the average time taken for inference across different model types and optimization techniques. The y-axis represents the model type (e.g., PyTorch CPU, TensorRT FP16, etc.), and the x-axis represents the average inference time in milliseconds. The shorter the bar, the faster the inference time.

  2. Throughput: This plot compares the throughput achieved by different model types. Throughput is measured in terms of the number of images processed per second. The y-axis represents the model type, and the x-axis represents the throughput. A higher bar indicates better throughput, meaning the model can process more images in a given time frame.

These plots offer a comprehensive view of the performance improvements achieved by various inference optimization techniques, especially when leveraging TensorRT with different precision types like FP16 and FP32.

CPU Results

Prediction results

#1: 15% Egyptian cat
#2: 14% tiger cat
#3: 9% tabby
#4: 2% doormat
#5: 2% lynx

PC Setup Linux

GPU (CUDA) Results

Inference Benchmark Results

Results explanation

Prediction results

#1: 15% Egyptian cat
#2: 14% tiger cat
#3: 9% tabby
#4: 2% doormat
#5: 2% lynx

PC Setup

CPU Results M1 Pro

Prediction results

#1: 15% Egyptian cat
#2: 14% tiger cat
#3: 9% tabby
#4: 2% doormat
#5: 2% lynx

M1 PRO Setup

Benchmark Implementation Details

Here you can see the flow for each model and benchmark.

PyTorch CPU & CUDA

In the provided code, we perform inference using the native PyTorch framework on both CPU and GPU (CUDA) configurations. This is a baseline to compare the performance improvements gained from other optimization techniques.

Flow:

  1. The ResNet-50 model is loaded from torchvision and, if available, transferred to the GPU.
  2. Inference is performed on the provided image using the specified model.
  3. Benchmark results, including average inference time, are logged for the CPU and CUDA setups.

TensorRT FP32 & FP16

TensorRT offers significant performance improvements by optimizing the neural network model. This code uses TensorRT's capabilities to run benchmarks in FP32 (single precision) and FP16 (half precision) modes.

Flow:

  1. Load the ResNet-50 model.
  2. Convert the PyTorch model to TensorRT format with the specified precision.
  3. Perform inference on the provided image.
  4. Log the benchmark results for the specified TensorRT precision mode.

ONNX

The code includes an exporter that converts the PyTorch ResNet-50 model to ONNX format, allowing it to be inferred using ONNX Runtime. This provides a flexible, cross-platform solution for deploying the model.

Flow:

  1. The ResNet-50 model is loaded.
  2. Using the ONNX exporter utility, the PyTorch model is converted to ONNX format.
  3. ONNX Runtime session is created.
  4. Inference is performed on the provided image using the ONNX model.
  5. Benchmark results are logged for the ONNX model.

OpenVINO

OpenVINO is a toolkit from Intel that optimizes deep learning model inference for Intel CPUs, GPUs, and other hardware. We convert the ONNX model to OpenVINO's format in the code and then run benchmarks using the OpenVINO runtime.

Flow:

  1. The ONNX model (created in the previous step) is loaded.
  2. Convert the ONNX model to OpenVINO's IR format.
  3. Create an inference engine using OpenVINO's runtime.
  4. Perform inference on the provided image using the OpenVINO model.
  5. Benchmark results, including average inference time, are logged for the OpenVINO model.

Author

DimaBir

References