Feature Request: Support for Meta Chameleon 7B and 34B

arch-btw commented 2 weeks ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

"Meta Chameleon is a family of models that can combine text and images as input and output any combination of text and images with a single unified architecture for both encoding and decoding. While most current late-fusion models use diffusion-based learning, Meta Chameleon uses tokenization for text and images. This enables a more unified approach and makes the model easier to design, maintain, and scale. The possibilities are endless—imagine generating creative captions for images or using a mix of text prompts and images to create an entirely new scene."

Motivation

This would be a great addition to llama.cpp!

The image features look interesting but it can also simply do Text -> Text and a lot of other combinations:

Text -> Text
Image -> Image
Text -> Image
Image -> Text
Image -> Text + Image
Text + Image -> Text
Text -> Text + Image
Text + Image -> Text + Image

https://github.com/ggerganov/llama.cpp/assets/57669023/23e92f5a-e782-4bb7-ab66-c20fe113d514

EliEron commented 2 weeks ago

Here are some relevant links:

~Given it's a completely new architecture~, and a multimodal one at that, I imagine adding support for it will not be easy. But I'm also very excited to see this supported.

Edit: According to Meta researcher Armen Aghajanyan the architecture is actually similar:

Similar architecture to LLaMa (apart from QK-norm), get fast inference working.

0wwafa commented 2 weeks ago

Yes! Please do! Also because as of now there is no way to run it on CPU only.

SolvAI commented 1 week ago

yum yum yum :p

ann-brown commented 1 week ago

Since it uses similar to Medusa architecture is that likely to be supportable at the same time for the self-speculative decoding side of inference? It sounded like it could run without that, but it'd be neat to have that available too.

jacobkahn commented 1 week ago

Let me know if we can answer any questions about the architecture, inference, etc. Our reference implementation in https://github.com/facebookresearch/chameleon should be clear. Differences from the Llama architecture are minor:

qk-norm in attention after the initial qk weight transformations
swin-norm (in the 30B model only) which is normalizes inputs before attention and feedforward blocks

typedrat commented 1 week ago

Considering the VQGAN is public, it should be possible for llama.cpp to reinstate the image output capabilities.

chigkim commented 1 week ago

+1000! I'd love to run Chameleon with llama.cpp!

ggerganov / llama.cpp