Open shixianc opened 10 months ago
Thanks for highlighting this - that's a very good suggestion to save memory. We'll evaluate what it would take to support it in TRT-LLM
May also consider MoE-Infinity, a cost-efficient mixture-of-expert (MoE) serving system that realizes activation-aware expert offloading. :)
There's a new cache technique mentioned in the paper https://arxiv.org/abs/2312.17238. (github: https://github.com/dvmazur/mixtral-offloading) They introduced LRU cache to cache experts based on patterns they found, and also took speculative guess to pre-load experts before the computation of the next layer. The result looks quite promising. Can we support it for Mixtral? This helps a lot to run on smaller GPUs.