How to reproduce the stable diffusion acceleration by tensorrt

NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Other

576 stars 43 forks source link

I have tried using TensorRT-Model-Optimizer to quantize stable diffusion 1.5 and comparing performance with pytorch and I found that TensorRT is even slower than pytorch. My testing environment： V100 GPU， cuda 11.8, pytorch 2.4, TensorRT 10.2 The unet time per iteration is 55ms for pytorch without torch compile, while for tensorrt, it's 48ms and 58ms for fp16 and int8 respectively. It's strange that int8 is even much slower and both FP16 and int8 can't get much acceleration than pytorch. So which tensorrt version should I use and can you give more detailed workflow to reproduce the results shown in https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/ Thank you very much.

NVIDIA / TensorRT-Model-Optimizer

How to reproduce the stable diffusion acceleration by tensorrt #57