improve LLaMA for visual understanding like GPT-4

AetherCortex / Llama-X

Open Academic Research on Improving LLaMA to SOTA LLM

Apache License 2.0

1.59k stars 101 forks source link

Thanks for the good works!

We have tried to improve LLaMa model to understand visual information and support multi-modal chatting. We are inspired that a good vit, e.g., CLIP vision encoder, and a well-trained large language model, e.g., LLaMA, with connection network, e.g., MLP or Transformer, can cover visual applications, like PALM-E.

The results in image captioning, VQA, and more multi-modal tasks, are promising in 7B and we call on more people to support testing of larger models.

Github: https://github.com/feizc/Visual-LLaMA

[X] fine-tuning scripts and hyper-parameters setting
[X] datasets for fine-grained alignment and instruct tuning
[x] interactive gradio and visual chatbot

AetherCortex / Llama-X

improve LLaMA for visual understanding like GPT-4 #13