Closed PanQiWei closed 1 year ago
I just public my attempt implementation of multi-modal llama-adapter-v2 here, this is just for learning purpose, if there are any incorrect implements, I would really appreciate for anyone to point out.
I'm also looking forwards to seeing the training code. There are some ambiguities in the paper; the way the V2 models are trained is not immediately obvious from the paper.
As for 3.) I would guess that they probably do batches of mixed instruction-following and the image-caption items.
I'm also looking forwards to seeing the training code. There are some ambiguities in the paper; the way the V2 models are trained is not immediately obvious from the paper.
As for 3.) I would guess that they probably do batches of mixed instruction-following and the image-caption items.
I didn't mix instruction-following data and image caption data in one batch, for the weights update are disjoint (at least I think so)
Demo and pretrained checkpoint of LLaMa Adapter V2 will be released in a few days. Sorry for the long waiting.
Please check out demo page http://llama-adapter.opengvlab.com/
If there is no vision_tokens, can adapter_prompt in the first layer still be used to compute adaption_output and added to original attention_output?
If no vision tokens (for GPT4LLM / Alcapca dataset), we generate a pseudo image with all zeros pixel.
Pretrained weights have been released. https://github.com/ZrrSkywalker/LLaMA-Adapter/tree/main/llama_adapter_v2_multimodal
Hi! first of all, thank you very much for publish this great work! I just read the code, and think the model framework is very similar to X-LLM, do you think this structure used in your and their work will be the standard way to build unified multi-modal LLMs?
https://github.com/ZrrSkywalker/LLaMA-Adapter/tree/main/imagebind_LLM
pretraining/finetuning/inference code has been released. We support image/video/text/autio/point cloud input and bilingual(chinese/english) response.
Sorry for the long waiting. Hope you enjoy our code.
This is awesome! 🔥 🔥 Thank you so much! ❤️
Hi! I attempt to implement the multi-modal ability of llama-adapter-v2 myself and I've already done most of the code using
transformers
andpeft
. But there are some details I'm not so sure, and if anyone can help answer these questions, I would be really grateful. ❤embeds = text_embeds + image_projection(vision_model(vison_tokens))
?And I'm really looking forward to seeing the official code of multi modal be released.