Closed generalsvr closed 7 months ago
Probably, since it's based on LLaMA. Doesn't look like there are architectural changes. There isn't really an easy way to submit images to models like that as far as I know though.
There isn't really an easy way to submit images to models like that as far as I know though.
server
shouldn't have a problem with images, the question is, what format does the model expect images in, because in the model description they add URLs to the prompt, and that's certainly not how it works directly, there has to be a pipeline for downloading images and somehow converting them into raw data.
Looks like they are using a helper model as an image encoder if I'm reading it correctly, that's a bit above my abilities, but I could make image upload API for the server
, to make it easier for others to possibly implement that.
There isn't really an easy way to submit images to models like that as far as I know though.
server
shouldn't have a problem with images, the question is, what format does the model expect images in, because in the model description they add URLs to the prompt, and that's certainly not how it works directly, there has to be a pipeline for downloading images and somehow converting them into raw data.
Hey! Victor from the idefics team here. Let me know if i can be helpful! In our hf transformers implementation, images can be passed as PIL image objects or through an image URL (in which case images are downloaded and returned as PIL Images). On the API side, images are passed as URLs that are fetched server side. I don't know enough about the inner workings of llama.cpp but I imagine image fetching could be handled in a similar way.
@VictorSanh To clarify, llamacpp is c++ and it would have to replicate whatever python does, and the problem I was thinking about is what "byte format" would the model expect images to be in, basically, uploading image through UI and processing it to appropriate binary blob is not a problem, the problem is, how exactly should that image blob be presented directly to the model itself, because typically it's "text" in "text" out, and for example models typically know what JSON is, so they can interpret JSON in the plaintext of the prompt, but there must be a way of "representing" image, whether it's something that goes through the tokenizer, or bypases the tokenizer, and that low level implementation part is what's unclear to me.
I found IDEFICS part of the transformers source code, and from my limited understanding of HF transformers, it looks to me like images are processed via CLIP type thing, and "inserted" into the model similarly-ish to how LoRa works ?
I might be very wrong, but that was my first intuition when I saw the code.
For reference: https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics/vision.py
@staviq, it is more or less the idea.
We start from the pixel values (image are resized into 224 x 224), and then normalized with fixed mean and std values.
That vector (224 x 224 x 3) then goes into a vision clip tower that is essentially clip model, which will then give a vector of size image_seq_len x hidden_size
where image_seq_len
is typically 256 or 257 and hidden_size
is typically 768 for these model sizes.
That vector gives a representation of the image which is then fused (or "inserted") into the language model part similar to an encoder-decoder (with cross-attentions). Here the encoder is the vision encoder, and the decoder is the llama language model.
I'm working on a project to support large multimodel models like Idefics: https://github.com/monatis/lmm.cpp
I've already implemented CLIP in GGML previously, and most of the large multimodal models like LLaVA, Idefics etc. are using CLIP as an image encoder, so it should be relatively easy to combine both.
The schedule is to release it this week --hopefully no later than friday-, and based on the feedback, community adoption and of course Gerganov's decision, it can be moved as an example to this repository as well.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Can llama.cpp support new multimodal model IDEFICS? IDEFICS based on llama-1 7b and llama-1 65b