Feature Request: Add support for Phi-3.5 MoE and Vision Instruct

YorkieDev commented 3 months ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Microsoft has recently dropped two new models in the Phi Family.

3.5 MoE: https://huggingface.co/microsoft/Phi-3.5-MoE-instruct 3.5 Vision: https://huggingface.co/microsoft/Phi-3.5-vision-instruct

It would be nice to see support added to llama.cpp for these two models.

Motivation

Supporting all model releases so the wider community can enjoy these great free models.

Possible Implementation

No response

curvedinf commented 3 months ago

MoE looks promising. Any word on how complex it is to add support for?

JackCloudman commented 3 months ago

Is someone working on it? :pray:

simsi-andy commented 3 months ago

Especially vision would be worth it. But I lack the knowledge to do smth. like this.

mounta11n commented 3 months ago

Yes, the vision model is surprisingly good. As a gguf format under llama.cpp, this would open up undreamt-of possibilities

Bildschirmfoto_20240821_213502

foldl commented 3 months ago

ChatLLM.cpp supports Phi-3.5 MoE model now.

For developers: MoE Sparse MLP is ~the same as~ a little different from the one used in Mixtral.

ayttop commented 3 months ago

https://github.com/foldl/chatllm.cpp

| Supported Models | Download Quantized Models |

What's New:

2024-08-28: Phi-3.5 Mini & MoE

Inference of a bunch of models from less than 1B to more than 300B, for real-time chatting with RAG on your computer (CPU), pure C++ implementation based on @ggerganov's ggml.

| Supported Models | Download Quantized Models |

What's New:

2024-08-28: Phi-3.5 Mini & MoE

ayttop commented 3 months ago

https://huggingface.co/microsoft/Phi-3.5-MoE-instruct/discussions/4

microsoft/Phi-3.5-MoE-instruct convert to gguf gguf

Dampfinchen commented 3 months ago

Pretty sad to see no support for Phi 3.5 MoE in llama.cpp. Sure, it might have dry writing and is very censored, but in assistant tasks it's much better than all the smaller models combined. It truly has 70B quality in just 6.6B active parameters so its much easier to run than even G2 27B (which it beats according to benchmarks).

sourceholder commented 3 months ago

@Dampfinchen, have you found any way to run Phi 3.5 MoE locally? I'm open to try out alternatives to llama.cpp.

arnesund commented 2 months ago

Also eager to get Phi 3.5-Vision support. Most accurate photo and screenshot descriptions I've seen so far.

EricLBuehler commented 2 months ago

@Dampfinchen @sourceholder @arnesund if you are interested in running Phi 3.5 MoE or Phi 3.5 vision with alternatives to llama.cpp, perhaps you could check out mistral.rs.

Just a quick description:

We have support for Phi 3.5 MoE (docs & example: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3.5MOE.md) and Phi 3.5 vision (docs & examples: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3V.md).

All models can be run with CUDA, Metal, or CPU SIMD acceleration. We have Flash Attention and PagedAttention support for increased inference performance, and support in-situ quantization in GGUF and HQQ formats.

If you are using the OpenAI API, you can use the provided OpenAI-compatible (superset, we have things like min-p, DRY, etc) HTTP server. There is also a Python package. For Phi 3.5 MoE and other text models, there is also an interactive chat mode.

Dampfinchen commented 2 months ago

Thank you, but I and many others rather wait for official support.

I wonder what's the holdup? Shouldn't it be possible to copy a lot of the code from Mixtral to Phi 3.5 MoE given they have a pretty similar architecture with two experts?

Thellton commented 2 months ago

Thank you, but I and many others rather wait for official support.

I wonder what's the holdup? Shouldn't it be possible to copy a lot of the code from Mixtral to Phi 3.5 MoE given they have a pretty similar architecture with two experts?

no one's taken the task up yet sadly. there's presently work being done on Phi-3.5 Vision Instruct though which is something to look forward to considering the reported vision understanding that the model has.

ayttop commented 2 months ago

phi-3.5-moe-instruct gguf lamacpp???????????????????????

bunnyfu commented 2 months ago

Bumping up thread. :)

vaibhav1618 commented 2 months ago

Strange why no one is looking into this. MoE seems to be the best model currently that can run on consumer-grade CPU

sourceholder commented 2 months ago

@vaibhav1618, FYI - Deepseek V2 Lite (16B) is another good MoE model. 2.4B activated params.

ThiloteE commented 2 months ago

Phi-3.5 MoE seems to be based on https://huggingface.co/microsoft/GRIN-MoE/tree/main. Maybe their technical report at https://arxiv.org/abs/2409.12136 can help at identifying differences to other MoE architectures, which should ease adoption in llama.cpp.

yueshen-intel commented 1 month ago

there's presently work being done on Phi-3.5 Vision Instruct though which is something to look forward to considering the reported vision understanding that the model has

I'm wondering where's the work being done on Phi-3.5 Vision Instruct ? Much thanks!

limingchina commented 1 month ago

@Dampfinchen @sourceholder @arnesund if you are interested in running Phi 3.5 MoE or Phi 3.5 vision with alternatives to llama.cpp, perhaps you could check out mistral.rs.

Just a quick description:

We have support for Phi 3.5 MoE (docs & example: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3.5MOE.md) and Phi 3.5 vision (docs & examples: https://github.com/EricLBuehler/mistral.rs/blob/master/docs/PHI3V.md).

All models can be run with CUDA, Metal, or CPU SIMD acceleration. We have Flash Attention and PagedAttention support for increased inference performance, and support in-situ quantization in GGUF and HQQ formats.

If you are using the OpenAI API, you can use the provided OpenAI-compatible (superset, we have things like min-p, DRY, etc) HTTP server. There is also a Python package. For Phi 3.5 MoE and other text models, there is also an interactive chat mode.

@EricLBuehler , Can you recommend some frontend app to use mistral.rs?

ThiloteE commented 1 month ago

The PR in the transformers repo to support Phi-3.5 MoE has been merged and is featured in release v4.46.0, so maybe finally llama.cpp can add this model architecture?

Oh and by the way, i just found the documentation for how to add a new model to llama.cpp, after having followed this repo for months now, lol. Here are the docs: https://github.com/ggerganov/llama.cpp/blob/master/docs/development/HOWTO-add-model.md

sinand99 commented 2 weeks ago

+1 for MoE support.

skylake5200 commented 23 hours ago

Also eager to get Phi 3.5-Vision support. Most accurate photo and screenshot descriptions I've seen so far.

+1 for Phi 3.5-Vision support.

ggerganov / llama.cpp