NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.75k stars 1k forks source link

SmoothQuant support for MoE models like Mixtral #1241

Open vinod-sarvam opened 8 months ago

vinod-sarvam commented 8 months ago

Hi,

When can we expect TRT-LLM to support Smooth Quant (W8A8) quantisations for MoE models like Mixtral. Is it being planned in your roadmap? Clarity on this would be highly beneficial.

Tracin commented 8 months ago

Hi @vinod-sarvam It is not decided yet. We will support W8A8 (in FP8 not INT8) recently and discuss about that later.

vinod-sarvam commented 8 months ago

Thanks @Tracin. Is FP8 already supported for Mixtral-type MoE models? When is that expected?

nv-guomingz commented 1 week ago

hi @vinod-sarvam please try our latest code base and moe had been supported fp8 yet.

And do u still have further issue or question now? If not, we'll close it soon.