Question about why not try using image tokenizer and a ready made llama3 etc LLM model with lora?

FoundationVision / LlamaGen

Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation

https://arxiv.org/abs/2406.06525

MIT License

930 stars 33 forks source link

Question about why not try using image tokenizer and a ready made llama3 etc LLM model with lora? #31

Open lucasjinreal opened 1 week ago

lucasjinreal commented 1 week ago

hi, for the txt image genreateion, why not try using existed LLMs, plus the decoder in tokenizer, and training the whole model with lora?

just like SEEd does.

So that it at least won't harm the LLM chat ability but with power to genreate images.

kabachuha commented 1 week ago

@lucasjinreal Basically Meta's Chamelion https://arxiv.org/pdf/2405.09818?

Btw, they released the code and the (requestable) checkpoint just yesterday https://github.com/facebookresearch/chameleon

And unlike SEED's CLIP outputs they use full encoder-decoder VQGAN, so it's similar to this repo in the regard of the input-output tokenization

lucasjinreal commented 1 week ago

The Chamelon does essentially treat image and text in a unified way (which simillar but they more compeletely)

What i means, might levearge the current existed models, come up with a model consist of each component, sepcifically, using an image tokenizer such as VQVAE used frequently, training an existed LLM such as llama3, using the LLM itself, generate images using the image tokenizer decoder by predicting the image tokens. The LLM might keep weights untouched just applying LoRA to learn new token vocabulary.

Do u think it workable?

(am doubt on this way was due to SEED actually trained their image tokenizer with text condition, the casual qformer and CLIP used to add text alignment, but most image tokenizer didn't, for example, yours, and open-magvit2 etc.)

kabachuha commented 1 week ago

I thought about this at some point too and I think it can fully work (just generate some new tokens + text). The only concerning thing is that you still have to tune models inner works because image generation is a not trivial task (+a whole new kind of 2d positional embeddings has to be interpreted), and it can overwrite a lot of original llama's parameters harming its conversational capabilities, so a balanced (text instruction following /image switching) training might be needed.

And while expanding the model's input vocabulary embedding is simple (see StackOverflow), expanding the output vocabulary will require replacing the transformer's final head, and you will have to relearn pre-existing tokens to make any meaningful text generation.

So with enough resources it's possible, I think

lucasjinreal commented 1 week ago

thanks for the insight thoughts sharing.

Am not have enough experience on training image generation but for understanding, I am thinking if I can make it work at first, then try to add generation.

But I am confused about the image tokenizers here, for instance, openmagvit2 uses pure convolution without any selfattentions and without any causal as well, how to really make use of it for LLM to understanding these tokens? Is it practical?

I also noticed OminiTokenzier which comes from yours as well, it uses ViTs, but the FID not as good as openmagvit2, on this point, how to really evaluate an tokenizers (image) real ablity on both understanding and generation? (note that openmagivt2 deocder didn't used any transformers as well, pure convolution)

PeizeSun commented 1 week ago

Hi~ Lora fine-tuning an existing LLM to enable image generation ability is a very promising direction, but how to NOT harm LLM text ability is quite challenging. In my opinion, just using finu-tuning data in MLLM, like LLaVa, is not enough. Fine-tuning LLM to MLLM still keeps the task in the scope of task generation. However, image generation task will severely change the original LLM.

PeizeSun commented 1 week ago

To evaluate an tokenizers is a open problem. It really depends on which tasks you focus on. For a universal tokenizer, I think we need to:

Image reconstrcution.
Image generation. Good image reconstruction does not necessarily bring to good image generation.
Image classification, using linear probing.
Dense perception tasks, like image segmentation.

lucasjinreal commented 1 week ago

Hi, if simple using Omnitokenizer for understanding, would it work? There were some work like SEED lavit which used same way (different from llava with conditional embedding features), they learn codebook ids directly, but, since both SEED and LaVIT trianed their tokenizer with CLIP and text conditioned, so that they can transfer to MLLM understanding without any doubt. With Omintokenizer, however, looks like the tokenizer itself doesn't knows any semantic kownledge (or I don't know how to prove it)

PeizeSun commented 1 week ago

I am not aware of details about Omnitokenizer, maybe you can connect its first author?

lucasjinreal commented 1 week ago

Thanks for the reply, OminiTokenizer also comes from FoundationVision.