[Feature Request] Mixtral Offloading

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

https://nvidia.github.io/TensorRT-LLM

Apache License 2.0

8.62k stars 979 forks source link

[Feature Request] Mixtral Offloading #849

Open shixianc opened 10 months ago

shixianc commented 10 months ago

There's a new cache technique mentioned in the paper https://arxiv.org/abs/2312.17238. (github: https://github.com/dvmazur/mixtral-offloading) They introduced LRU cache to cache experts based on patterns they found, and also took speculative guess to pre-load experts before the computation of the next layer. The result looks quite promising. Can we support it for Mixtral? This helps a lot to run on smaller GPUs.

ncomly-nvidia commented 9 months ago

Thanks for highlighting this - that's a very good suggestion to save memory. We'll evaluate what it would take to support it in TRT-LLM

shiqingzhangCSU commented 6 months ago

May also consider MoE-Infinity, a cost-efficient mixture-of-expert (MoE) serving system that realizes activation-aware expert offloading. :)