Questions about implementation of llama-adapter-v2's multi-modal ability and training

OpenGVLab / LLaMA-Adapter

[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters

GNU General Public License v3.0

5.69k stars 370 forks source link

Questions about implementation of llama-adapter-v2's multi-modal ability and training #38

Closed PanQiWei closed 1 year ago

PanQiWei commented 1 year ago

Hi! I attempt to implement the multi-modal ability of llama-adapter-v2 myself and I've already done most of the code using transformers and peft. But there are some details I'm not so sure, and if anyone can help answer these questions, I would be really grateful. ❤

Is the early fusion coded like this: embeds = text_embeds + image_projection(vision_model(vison_tokens))?
If there is no vision_tokens, can adapter_prompt in the first layer still be used to compute adaption_output and added to original attention_output?
Does the training procedure works like this: one step all instruction-following data, next step all image-caption data...?

And I'm really looking forward to seeing the official code of multi modal be released.

PanQiWei commented 1 year ago

I just public my attempt implementation of multi-modal llama-adapter-v2 here, this is just for learning purpose, if there are any incorrect implements, I would really appreciate for anyone to point out.

theAdamColton commented 1 year ago

I'm also looking forwards to seeing the training code. There are some ambiguities in the paper; the way the V2 models are trained is not immediately obvious from the paper.

As for 3.) I would guess that they probably do batches of mixed instruction-following and the image-caption items.

PanQiWei commented 1 year ago

I'm also looking forwards to seeing the training code. There are some ambiguities in the paper; the way the V2 models are trained is not immediately obvious from the paper.

As for 3.) I would guess that they probably do batches of mixed instruction-following and the image-caption items.

I didn't mix instruction-following data and image caption data in one batch, for the weights update are disjoint (at least I think so)