Open Blaizzy opened 5 months ago
Next release of Llava-Next
TODO: update text config defaults to avoid errors with Llava-v1.6-vicuna:
class TextConfig:
model_type: str
hidden_size: int = 4096
num_hidden_layers: int = 32
intermediate_size: int = 11008
num_attention_heads: int = 32
rms_norm_eps: float = 1e-05
vocab_size: int = 32064
num_key_value_heads: int = 32
rope_theta: float = 1000000
rope_traditional: bool = False
rope_scaling: Optional[Dict[str, Union[float, str]]] = None
Thanks for the great repo. This should also be on the list: https://github.com/THUDM/CogVLM2 I am now just reading the code, and trying to free some time for the conversion routine.
Hey @BoltzmannEntropy and @jrp2014,
Thanks for the suggestions!
I have added them to the backlog
MiniCPM-V v2.6
MiniCPM-V v2.6
Do you have a link to Florence-2?
Is the above list the ultimate and up-to-date list of supported models @Blaizzy? Thanks for your hard work!
Hey @ChristianWeyer Its mostly up-to-date, just missing qwen2-vl
@s-smits here you go:
https://huggingface.co/microsoft/Florence-2-large/blob/main/modeling_florence2.py
[x] Phi-3-vision
Thanks! I guess Phi-3-vision includes 3.5?
Yes, they have the same arch so there are no changes needed :)
Hey @Blaizzy, thanks for this great framework. Is there any priority for InternVL? I can see it is present in your list. Just wanted to know if it planned in your near term. Want to make the model run on my macbook and mlx-vlm looks to be the best way for that.
Qwen2-VL-72B would be amazing!
This recipe seems to work for Qwen2-VL-2B-Instruct:
python -m mlx_vlm.generate \
--model Qwen/Qwen2-VL-2B-Instruct \
--max-tokens 100 \
--temp 0.0 \
--image django-roadmap.png \
--prompt "Describe image in detail, include all text"
My results here: https://gist.github.com/simonw/9e02d425cacb902260ec1307e0671e17
Yep they just merged Qwen2-vl support this weekend.
Molmo please
Nvidia just dropped multimodal NVLM-D-72B. The benchmark looks pretty good.
Yap, that's a pretty awesome model! It's on my radar because we can run it in 4bit quant
Pixtral-12B now has Base model. https://huggingface.co/mistralai/Pixtral-12B-Base-2409
Hey @Blaizzy, could you add ColQwen support? As there already is qwen2-vl and ColQwen is just an additional linear layer on top this seems to be a low hanging fruit, also considering Col* is a really hot topic right now.
I could really use this for my projects (e.g. local private document search + qa) 😊
Working on Idefics 3 here: https://github.com/Blaizzy/mlx-vlm/pull/124
@Benjoyo, ColQwen and CoPali are awesome models.
At the moment, I'm going working on refactoring and some optimisations. New model ports by me are on hold.
However, I appreaciate any PRs. I'm here to review and help when needed.
Thanky you very much, @pcuenca!
It means a lot 🚀
I left a few comments.
Instructions:
If the model you want is not listed, please suggest it and I will add it.