Feature Request: Support for `phi-3-vision-128k-instruct`

Blaizzy / mlx-vlm

MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.

MIT License

468 stars 35 forks source link

Feature Request: Support for `phi-3-vision-128k-instruct` #28

Closed JosefAlbers closed 4 months ago

JosefAlbers commented 5 months ago

Hi, I've been exploring this repo for the past couple of days and I find your work here really amazing. I'm curious if there are any plans to add support for the Phi-3-vision-128k-instruct model to this library? I'd be happy to contribute in any way I can to help make this happen.

Blaizzy commented 5 months ago

Hey @JosefAlbers

Thank you!

Awesome, that model is on the roadmap after Paligemma #24.

Please feel free to submit a PR to support it :)

Blaizzy commented 5 months ago

@JosefAlbers

Paligemma is done, thanks!

Do you want to take on Phi-3-vision?

JosefAlbers commented 5 months ago

Yes, I'd love to! Just a heads-up, I'm new to mlx, so I might need a little guidance along the way.

Blaizzy commented 5 months ago

No problem, I'm here to help :)

ChristianWeyer commented 5 months ago

@JosefAlbers

Paligemma is done, thanks!

Do you want to take on Phi-3-vision?

Is there a list of officially supported models?

Blaizzy commented 5 months ago

@ChristianWeyer not yet.

But at the moment we support the following archictetures:

Llava (Clip + Llama)
Paligemma (Siglip + Gemma)
Idefics2 (Siglip + Mistral)
NanoLlava (Siglip + Qwen2)

Blaizzy commented 5 months ago

There are still many more to add.

ChristianWeyer commented 5 months ago

@ChristianWeyer not yet.

But at the moment we support the following archictetures:

Llava (Clip + Llama)

Paligemma (Siglip + Gemma)

Idefics2 (Siglip + Mistral)

NanoLlava (Siglip + Qwen2)

Which high quality Llava model can we use - any recommendations (from HF)?

Blaizzy commented 5 months ago

Here you go:

https://huggingface.co/mlx-community?search_models=llava

ChristianWeyer commented 5 months ago

Thx. These are not good enough for our use cases ;-).

Blaizzy commented 5 months ago

Could you please open a new issue and explain your use case?

JosefAlbers commented 5 months ago

@Blaizzy, I have a working demo of Phi-3-vision support for MLX: https://github.com/JosefAlbers/Phi-3-Vision-MLX

It handles texts and image inputs, generating expected outputs. With the new Su-scaled RoPE, it seems to work reasonably well even with extremely long contexts.

Just a heads-up for now. I'll circle back when it's more polished and ready for feedback.

Blaizzy commented 5 months ago

I love the speed!

Awesome, looking forward to the polished version :)

JosefAlbers commented 5 months ago

@Blaizzy Thanks so much, I've learned a ton about MLX and VLMs by studying the well written and documented codes in your repo. I'll keep you posted on my progress and will definitely reach out when I have a more polished version ready for your feedback!

Blaizzy commented 5 months ago

Most welcome!

I'm happy I could be of help,

Let me know when you ready.

lin72h commented 5 months ago

You guys are heroes!

JosefAlbers commented 5 months ago

@Blaizzy, I'd really appreciate it! I'm just about to start working on a PR for adding su-RoPE support to mlx-lm. Once that is merged, I think I can craft a version of the phi-3-vision that can fit seamlessly into the mlx-vlm framework.

In the meantime, I've been experimenting the model with various inputs and LLM/VLM techniques in my own repo, and am really amazed by how well it handles both text and image prompts. I'm excited to get your feedback!

@lin72h, thanks a lot!

Blaizzy commented 5 months ago

Most welcome, it's my pleasure!

I'm just about to start working on a PR for adding su-RoPE support to mlx-lm. Once that is merged,

@JosefAlbers Why do the round trip? When we can have it here.

Note: mlx-lm is only for language models, thus the lm. Unless there are other language models that use su-RoPE it's not going to be merged.

JosefAlbers commented 5 months ago

@JosefAlbers Why do the round trip? When we can have it here.

Note: mlx-lm is only for language models, thus the lm. Unless there are other language models that use su-RoPE it's not going to be merged.

@Blaizzy Right, I will see if I can port the phi3_v into the mlx_vlm today.