ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.62k stars 9.86k forks source link

Support IDEFICS models #2803

Closed generalsvr closed 7 months ago

generalsvr commented 1 year ago

Can llama.cpp support new multimodal model IDEFICS? IDEFICS based on llama-1 7b and llama-1 65b

KerfuffleV2 commented 1 year ago

Probably, since it's based on LLaMA. Doesn't look like there are architectural changes. There isn't really an easy way to submit images to models like that as far as I know though.

staviq commented 1 year ago

There isn't really an easy way to submit images to models like that as far as I know though.

server shouldn't have a problem with images, the question is, what format does the model expect images in, because in the model description they add URLs to the prompt, and that's certainly not how it works directly, there has to be a pipeline for downloading images and somehow converting them into raw data.

klosax commented 1 year ago

Paper: https://arxiv.org/abs/2204.14198

staviq commented 1 year ago

Paper: https://arxiv.org/abs/2204.14198

Looks like they are using a helper model as an image encoder if I'm reading it correctly, that's a bit above my abilities, but I could make image upload API for the server, to make it easier for others to possibly implement that.

VictorSanh commented 1 year ago

There isn't really an easy way to submit images to models like that as far as I know though.

server shouldn't have a problem with images, the question is, what format does the model expect images in, because in the model description they add URLs to the prompt, and that's certainly not how it works directly, there has to be a pipeline for downloading images and somehow converting them into raw data.

Hey! Victor from the idefics team here. Let me know if i can be helpful! In our hf transformers implementation, images can be passed as PIL image objects or through an image URL (in which case images are downloaded and returned as PIL Images). On the API side, images are passed as URLs that are fetched server side. I don't know enough about the inner workings of llama.cpp but I imagine image fetching could be handled in a similar way.

staviq commented 1 year ago

@VictorSanh To clarify, llamacpp is c++ and it would have to replicate whatever python does, and the problem I was thinking about is what "byte format" would the model expect images to be in, basically, uploading image through UI and processing it to appropriate binary blob is not a problem, the problem is, how exactly should that image blob be presented directly to the model itself, because typically it's "text" in "text" out, and for example models typically know what JSON is, so they can interpret JSON in the plaintext of the prompt, but there must be a way of "representing" image, whether it's something that goes through the tokenizer, or bypases the tokenizer, and that low level implementation part is what's unclear to me.

staviq commented 1 year ago

I found IDEFICS part of the transformers source code, and from my limited understanding of HF transformers, it looks to me like images are processed via CLIP type thing, and "inserted" into the model similarly-ish to how LoRa works ?

I might be very wrong, but that was my first intuition when I saw the code.

For reference: https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics/vision.py

VictorSanh commented 1 year ago

@staviq, it is more or less the idea. We start from the pixel values (image are resized into 224 x 224), and then normalized with fixed mean and std values. That vector (224 x 224 x 3) then goes into a vision clip tower that is essentially clip model, which will then give a vector of size image_seq_len x hidden_size where image_seq_len is typically 256 or 257 and hidden_size is typically 768 for these model sizes. That vector gives a representation of the image which is then fused (or "inserted") into the language model part similar to an encoder-decoder (with cross-attentions). Here the encoder is the vision encoder, and the decoder is the llama language model.

monatis commented 1 year ago

I'm working on a project to support large multimodel models like Idefics: https://github.com/monatis/lmm.cpp

I've already implemented CLIP in GGML previously, and most of the large multimodal models like LLaVA, Idefics etc. are using CLIP as an image encoder, so it should be relatively easy to combine both.

The schedule is to release it this week --hopefully no later than friday-, and based on the feedback, community adoption and of course Gerganov's decision, it can be moved as an example to this repository as well.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.