TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
I have tried using TensorRT-Model-Optimizer to quantize stable diffusion 1.5 and comparing performance with pytorch and I found that TensorRT is even slower than pytorch.
My testing environment: V100 GPU, cuda 11.8, pytorch 2.4, TensorRT 10.2
The unet time per iteration is 55ms for pytorch without torch compile, while for tensorrt, it's 48ms and 58ms for fp16 and int8 respectively. It's strange that int8 is even much slower and both FP16 and int8 can't get much acceleration than pytorch.
So which tensorrt version should I use and can you give more detailed workflow to reproduce the results shown in https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/
Thank you very much.
I have tried using TensorRT-Model-Optimizer to quantize stable diffusion 1.5 and comparing performance with pytorch and I found that TensorRT is even slower than pytorch. My testing environment: V100 GPU, cuda 11.8, pytorch 2.4, TensorRT 10.2 The unet time per iteration is 55ms for pytorch without torch compile, while for tensorrt, it's 48ms and 58ms for fp16 and int8 respectively. It's strange that int8 is even much slower and both FP16 and int8 can't get much acceleration than pytorch. So which tensorrt version should I use and can you give more detailed workflow to reproduce the results shown in https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/ Thank you very much.