Closed aiaicode closed 1 year ago
It would depend on having access to high quality multi-modal models. I don't know if one exists yet, in the same league as llama.
Hopefully Llama3 would be that.
Yesterday LLaVA-RLHF was announced. It's the first open-source RLHF-trained multimodal model. And we previously had Idefics from HF. After introducing GGUF support in clip.cpp, now it's possible to implement multimodal inference by combining it with llama.cpp. Architecturally LLaVA is much simpler than Idefics, but if Idefics' performance is considerably better than LLaVA-RLHF, I can start with it as well. WDYT?
We should make a PoC (either as a separate repo or as an example in this repo) to implement LLaVA
I started to work on LLaVA in another repo but it's extremely difficult to manage llama.cpp and clip.cpp together while depending two different versions of ggml, so it would be much easier for me if it's ok to implement it in this repo.
Thank you @monatis ! You legend.
Hi, do I understand correctly that the multimodal support is now added? how to run such a model using a cli? say I have a photo to analise and downloaded the zhiqings/LLaVA-RLHF-7b-v1.5-224 model from hugging face.
I am really new to the field, recently compiled llama.cpp locally, played aroud with it, can you point me to some materials/tutorials?
PS. when I saw the project I was quickly overwelmed. I could work on documentation of how to use it but I am soo new. do contibutors meet to discuss the developement or something ??
@ChrisW-priv Not sure this is still relevant to you but this is actually documented in the original MR:
./bin/llava -m ggml-model-q5_k.gguf --mmproj mmproj-model-f16.gguf --image path/to/an/image.jpg
Now that OpenAI is adding voice and image to ChatGPT and will probably be the new norm, wouldn't it be a good idea for llama.cpp to also please add this to the roadmap? if possible?