Open YorkieDev opened 3 months ago
MoE looks promising. Any word on how complex it is to add support for?
Is someone working on it? :pray:
Especially vision would be worth it. But I lack the knowledge to do smth. like this.
Yes, the vision model is surprisingly good. As a gguf format under llama.cpp, this would open up undreamt-of possibilities
ChatLLM.cpp supports Phi-3.5 MoE model now.
For developers: MoE Sparse MLP is ~the same as~ a little different from the one used in Mixtral.
https://github.com/foldl/chatllm.cpp
| Supported Models | Download Quantized Models |
What's New:
2024-08-28: Phi-3.5 Mini & MoE
Inference of a bunch of models from less than 1B to more than 300B, for real-time chatting with RAG on your computer (CPU), pure C++ implementation based on @ggerganov's ggml.
| Supported Models | Download Quantized Models |
What's New:
2024-08-28: Phi-3.5 Mini & MoE
https://huggingface.co/microsoft/Phi-3.5-MoE-instruct/discussions/4
microsoft/Phi-3.5-MoE-instruct convert to gguf gguf
Pretty sad to see no support for Phi 3.5 MoE in llama.cpp. Sure, it might have dry writing and is very censored, but in assistant tasks it's much better than all the smaller models combined. It truly has 70B quality in just 6.6B active parameters so its much easier to run than even G2 27B (which it beats according to benchmarks).
@Dampfinchen, have you found any way to run Phi 3.5 MoE locally? I'm open to try out alternatives to llama.cpp.
Also eager to get Phi 3.5-Vision support. Most accurate photo and screenshot descriptions I've seen so far.
@Dampfinchen @sourceholder @arnesund if you are interested in running Phi 3.5 MoE or Phi 3.5 vision with alternatives to llama.cpp, perhaps you could check out mistral.rs.
Just a quick description:
We have support for Phi 3.5 MoE (docs & example: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3.5MOE.md) and Phi 3.5 vision (docs & examples: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3V.md).
All models can be run with CUDA, Metal, or CPU SIMD acceleration. We have Flash Attention and PagedAttention support for increased inference performance, and support in-situ quantization in GGUF and HQQ formats.
If you are using the OpenAI API, you can use the provided OpenAI-compatible (superset, we have things like min-p, DRY, etc) HTTP server. There is also a Python package. For Phi 3.5 MoE and other text models, there is also an interactive chat mode.
Thank you, but I and many others rather wait for official support.
I wonder what's the holdup? Shouldn't it be possible to copy a lot of the code from Mixtral to Phi 3.5 MoE given they have a pretty similar architecture with two experts?
Thank you, but I and many others rather wait for official support.
I wonder what's the holdup? Shouldn't it be possible to copy a lot of the code from Mixtral to Phi 3.5 MoE given they have a pretty similar architecture with two experts?
no one's taken the task up yet sadly. there's presently work being done on Phi-3.5 Vision Instruct though which is something to look forward to considering the reported vision understanding that the model has.
phi-3.5-moe-instruct gguf lamacpp???????????????????????
Bumping up thread. :)
Strange why no one is looking into this. MoE seems to be the best model currently that can run on consumer-grade CPU
@vaibhav1618, FYI - Deepseek V2 Lite (16B) is another good MoE model. 2.4B activated params.
Phi-3.5 MoE seems to be based on https://huggingface.co/microsoft/GRIN-MoE/tree/main. Maybe their technical report at https://arxiv.org/abs/2409.12136 can help at identifying differences to other MoE architectures, which should ease adoption in llama.cpp.
there's presently work being done on Phi-3.5 Vision Instruct though which is something to look forward to considering the reported vision understanding that the model has
I'm wondering where's the work being done on Phi-3.5 Vision Instruct ? Much thanks!
@Dampfinchen @sourceholder @arnesund if you are interested in running Phi 3.5 MoE or Phi 3.5 vision with alternatives to llama.cpp, perhaps you could check out mistral.rs.
Just a quick description:
We have support for Phi 3.5 MoE (docs & example: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3.5MOE.md) and Phi 3.5 vision (docs & examples: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3V.md).
All models can be run with CUDA, Metal, or CPU SIMD acceleration. We have Flash Attention and PagedAttention support for increased inference performance, and support in-situ quantization in GGUF and HQQ formats.
If you are using the OpenAI API, you can use the provided OpenAI-compatible (superset, we have things like min-p, DRY, etc) HTTP server. There is also a Python package. For Phi 3.5 MoE and other text models, there is also an interactive chat mode.
@EricLBuehler , Can you recommend some frontend app to use mistral.rs?
The PR in the transformers repo to support Phi-3.5 MoE has been merged and is featured in release v4.46.0, so maybe finally llama.cpp can add this model architecture?
Oh and by the way, i just found the documentation for how to add a new model to llama.cpp, after having followed this repo for months now, lol. Here are the docs: https://github.com/ggerganov/llama.cpp/blob/master/docs/development/HOWTO-add-model.md
+1 for MoE support.
Also eager to get Phi 3.5-Vision support. Most accurate photo and screenshot descriptions I've seen so far.
+1 for Phi 3.5-Vision support.
Prerequisites
Feature Description
Microsoft has recently dropped two new models in the Phi Family.
3.5 MoE: https://huggingface.co/microsoft/Phi-3.5-MoE-instruct 3.5 Vision: https://huggingface.co/microsoft/Phi-3.5-vision-instruct
It would be nice to see support added to llama.cpp for these two models.
Motivation
Supporting all model releases so the wider community can enjoy these great free models.
Possible Implementation
No response