NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
576 stars 43 forks source link

How to reproduce the stable diffusion acceleration by tensorrt #57

Closed luchangli03 closed 3 months ago

luchangli03 commented 3 months ago

I have tried using TensorRT-Model-Optimizer to quantize stable diffusion 1.5 and comparing performance with pytorch and I found that TensorRT is even slower than pytorch. My testing environment: V100 GPU, cuda 11.8, pytorch 2.4, TensorRT 10.2 The unet time per iteration is 55ms for pytorch without torch compile, while for tensorrt, it's 48ms and 58ms for fp16 and int8 respectively. It's strange that int8 is even much slower and both FP16 and int8 can't get much acceleration than pytorch. So which tensorrt version should I use and can you give more detailed workflow to reproduce the results shown in https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/ Thank you very much.

luchangli03 commented 3 months ago

the default tensorrt conversion can't provide much acceleration, but we need extra plugin shown in https://zhuanlan.zhihu.com/p/600216796