OpenGVLab / LLaMA-Adapter

[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
GNU General Public License v3.0
5.69k stars 370 forks source link

Questions about implementation of llama-adapter-v2's multi-modal ability and training #38

Closed PanQiWei closed 1 year ago

PanQiWei commented 1 year ago

Hi! I attempt to implement the multi-modal ability of llama-adapter-v2 myself and I've already done most of the code using transformers and peft. But there are some details I'm not so sure, and if anyone can help answer these questions, I would be really grateful. ❤

  1. Is the early fusion coded like this: embeds = text_embeds + image_projection(vision_model(vison_tokens))?
  2. If there is no vision_tokens, can adapter_prompt in the first layer still be used to compute adaption_output and added to original attention_output?
  3. Does the training procedure works like this: one step all instruction-following data, next step all image-caption data...?

And I'm really looking forward to seeing the official code of multi modal be released.

PanQiWei commented 1 year ago

I just public my attempt implementation of multi-modal llama-adapter-v2 here, this is just for learning purpose, if there are any incorrect implements, I would really appreciate for anyone to point out.

theAdamColton commented 1 year ago

I'm also looking forwards to seeing the training code. There are some ambiguities in the paper; the way the V2 models are trained is not immediately obvious from the paper.

As for 3.) I would guess that they probably do batches of mixed instruction-following and the image-caption items.

PanQiWei commented 1 year ago

I'm also looking forwards to seeing the training code. There are some ambiguities in the paper; the way the V2 models are trained is not immediately obvious from the paper.

As for 3.) I would guess that they probably do batches of mixed instruction-following and the image-caption items.

I didn't mix instruction-following data and image caption data in one batch, for the weights update are disjoint (at least I think so)

gaopengpjlab commented 1 year ago

Demo and pretrained checkpoint of LLaMa Adapter V2 will be released in a few days. Sorry for the long waiting.

gaopengpjlab commented 1 year ago

Please check out demo page http://llama-adapter.opengvlab.com/

gaopengpjlab commented 1 year ago

If there is no vision_tokens, can adapter_prompt in the first layer still be used to compute adaption_output and added to original attention_output?

If no vision tokens (for GPT4LLM / Alcapca dataset), we generate a pseudo image with all zeros pixel.

gaopengpjlab commented 1 year ago

Pretrained weights have been released. https://github.com/ZrrSkywalker/LLaMA-Adapter/tree/main/llama_adapter_v2_multimodal

PanQiWei commented 1 year ago

Hi! first of all, thank you very much for publish this great work! I just read the code, and think the model framework is very similar to X-LLM, do you think this structure used in your and their work will be the standard way to build unified multi-modal LLMs?

gaopengpjlab commented 1 year ago

https://github.com/ZrrSkywalker/LLaMA-Adapter/tree/main/imagebind_LLM

pretraining/finetuning/inference code has been released. We support image/video/text/autio/point cloud input and bilingual(chinese/english) response.

Sorry for the long waiting. Hope you enjoy our code.

PanQiWei commented 1 year ago

This is awesome! 🔥 🔥 Thank you so much! ❤️