SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
MIT License
7.89k stars 405 forks source link

Combined with LLM in a flash #39

Closed qwopqwop200 closed 9 months ago

qwopqwop200 commented 9 months ago

https://arxiv.org/abs/2312.11514 Recently, LLM in a Flash was proposed, a method to use Flash memory to run models that exceed DRAM. If I'm right, I think we can apply these technologies simultaneously. If that were possible, I think it would make running very large models easier.

YixinSong-e commented 9 months ago

Yes, this is in our roadmap. Thank you for your attention. :)

leemgs commented 8 months ago

Can we see the task in the roadmap such as https://github.com/orgs/SJTU-IPADS/projects/2/views/3? Unfortunately, I could not find a related task. ;(

YixinSong-e commented 8 months ago

Hello, although LLM in a flash is not currently in Roadmap, we are indeed working on it. Please stay tuned. At present, our plan is to release it together with relu-mistral-7B.

Can we see the task in the roadmap such as https://github.com/orgs/SJTU-IPADS/projects/2/views/3? Unfortunately, I could not find a related task. ;(

leemgs commented 8 months ago

I found a paper and a GitHub address that can be reasonable even if there is no high-performance GPU (e.g. V100, A100, H100) using Mistral-7B. This paper was uploaded to ARXIV on January 13, 2024 by the Yandex Institute. The title is "FAST INFERERENCE OF MIXTURE-OF-EXPERTS Language Models with OFFLOADING". However, this paper deals with the model transmission environment between Dram-to-GDDR, not Flash-to-DRAM. However, I am thinking about Active Neuron, which is a common factor technology. For example, we use MoE (MIXTURE-OF-EXPERTS), a model architecture type that only activates a part of the model layer for a given input.

Please stay tuned. At present, our plan is to release it together with relu-mistral-7B.