Closed qwopqwop200 closed 9 months ago
Yes, this is in our roadmap. Thank you for your attention. :)
Can we see the task in the roadmap such as https://github.com/orgs/SJTU-IPADS/projects/2/views/3? Unfortunately, I could not find a related task. ;(
Hello, although LLM in a flash is not currently in Roadmap, we are indeed working on it. Please stay tuned. At present, our plan is to release it together with relu-mistral-7B.
Can we see the task in the roadmap such as https://github.com/orgs/SJTU-IPADS/projects/2/views/3? Unfortunately, I could not find a related task. ;(
I found a paper and a GitHub address that can be reasonable even if there is no high-performance GPU (e.g. V100, A100, H100) using Mistral-7B. This paper was uploaded to ARXIV on January 13, 2024 by the Yandex Institute. The title is "FAST INFERERENCE OF MIXTURE-OF-EXPERTS Language Models with OFFLOADING". However, this paper deals with the model transmission environment between Dram-to-GDDR, not Flash-to-DRAM. However, I am thinking about Active Neuron, which is a common factor technology. For example, we use MoE (MIXTURE-OF-EXPERTS), a model architecture type that only activates a part of the model layer for a given input.
Please stay tuned. At present, our plan is to release it together with relu-mistral-7B.
https://arxiv.org/abs/2312.11514 Recently, LLM in a Flash was proposed, a method to use Flash memory to run models that exceed DRAM. If I'm right, I think we can apply these technologies simultaneously. If that were possible, I think it would make running very large models easier.