NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
575 stars 43 forks source link
# NVIDIA TensorRT Model Optimizer #### A Library to Quantize and Compress Deep Learning Models for Optimized Inference on GPUs [![Documentation](https://img.shields.io/badge/Documentation-latest-brightgreen.svg?style=flat)](https://nvidia.github.io/TensorRT-Model-Optimizer) [![version](https://img.shields.io/pypi/v/nvidia-modelopt?label=Release)](https://pypi.org/project/nvidia-modelopt/) [![license](https://img.shields.io/badge/License-MIT-blue)](./LICENSE) [Examples](#examples) | [Documentation](https://nvidia.github.io/TensorRT-Model-Optimizer) | [Benchmark Results](./benchmark.md) | [Roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/108) | [ModelOpt-Windows](./examples/windows/README.md)

Latest News

Table of Contents

Model Optimizer Overview

Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size. The NVIDIA TensorRT Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization techniques including quantization, sparsity, distillation, and pruning to compress models. It accepts a torch or ONNX model as inputs and provides Python APIs for users to easily stack different model optimization techniques to produce an optimized quantized checkpoint. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like TensorRT-LLM or TensorRT. ModelOpt is integrated with NVIDIA NeMo and Megatron-LM for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on NVIDIA NIM.

Model Optimizer for both Linux and Windows are available for free for all developers on NVIDIA PyPI. This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.

Installation / Docker

Easiest way to get started with using Model Optimizer and additional dependencies (e.g. TensorRT-LLM deployment) is to start from our docker image.

After installing the NVIDIA Container Toolkit, please run the following commands to build the Model Optimizer docker container which has all the necessary dependencies pre-installed for running the examples.

# Clone the ModelOpt repository
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
cd TensorRT-Model-Optimizer

# Build the docker (will be tagged `docker.io/library/modelopt_examples:latest`)
# You may customize `docker/Dockerfile` to include or exclude certain dependencies you may or may not need.
bash docker/build.sh

# Run the docker image
docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_examples:latest bash

# Check installation (inside the docker container)
python -c "import modelopt; print(modelopt.__version__)"

See the installation guide for more details on alternate pre-built docker images or installation in a local environment.

NOTE: Unless specified otherwise, all example READMEs assume they are using the above ModelOpt docker image for running the examples. The example specific dependencies are required to be install separately from their respective requirements.txt files if not using the ModelOpt's docker image.

Techniques

Quantization

Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported.

Sparsity

Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference. Model Optimizer Python APIs to apply weight sparsity to a given model. It also supports NVIDIA 2:4 sparsity pattern and various sparsification methods, such as NVIDIA ASP and SparseGPT.

Pruning

Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, embedding hidden size and number of layers (depth).

Distillation

Knowledge Distillation allows for increasing the accuracy and/or convergence speed of a desired model architecture by using a more powerful model's learned features to guide a student model's objective function into imitating it.

Examples

Support Matrix

Benchmark

Please find the benchmarks here.

Quantized Checkpoints

Quantized checkpoints in Hugging Face model hub are ready for TensorRT-LLM and vLLM deployments. More models coming soon.

Roadmap

Please see our product roadmap.

Release Notes

Please see Model Optimizer Changelog here.