Fantasyele / LLaVA-KD

46 stars 5 forks source link
[Yuxuan Cai1*](https://scholar.google.com/citations?user=J9lTFAUAAAAJ&hl=en&oi=ao), [Jiangning Zhang2,3*](https://zhangzjn.github.io), [Haoyang He2](https://scholar.google.com/citations?hl=zh-CN&user=8NfQv1sAAAAJ), [Xinwei He4](https://scholar.google.com/citations?user=YSIe_24AAAAJ&hl=en&oi=ao), [Ao Tong1](), [Zhenye Gan3](https://scholar.google.com/citations?user=fa4NkScAAAAJ&hl=zh-CN), [Chengjie Wang3](https://scholar.google.com/citations?hl=zh-CN&user=fqte5H4AAAAJ), [Xiang Bai1](https://scholar.google.com/citations?user=UeltiQ4AAAAJ&hl=en&oi=ao) 1Huazhong University of Science and Technology, 2Zhejiang University, 3Youtu Lab, Tencent, 4Huazhong Agricultural University [[`Paper`](https://arxiv.org/pdf/2410.16236)]

Abstract

The success of Large Language Models (LLM) has led researchers to explore Multimodal Large Language Models (MLLM) for unified visual and linguistic understanding. However, the increasing model size and computational complexity of MLLM limit their use in resource-constrained environments. Small-scale MLLM ($s$-MLLM) aims to retain the capabilities of the large-scale model ($l$-MLLM) while reducing computational demands, but resulting in a significant decline in performance. To address the aforementioned issues, we propose a novel LLaVA-KD framework to transfer knowledge from $l$-MLLM to $s$-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of $l$-MLLM and $s$-MLLM, and Relation Distillation (RDist) to transfer $l$-MLLM’s ability to model correlations between visual features. Additionally, we propose a three-stage training scheme to fully exploit the potential of $s$-MLLM: (1) Distilled Pre-Training to align visual-textual representations, (2) Supervised Fine-Tuning to equip the model with multimodal understanding, and (3) Distilled Fine-Tuning to further transfer $l$-MLLM capabilities. Our approach significantly improves performance without altering the small model's architecture. Extensive experiments and ablation studies validate the effectiveness of each proposed component.


Overview

accuracy


📜 Main Results on 10 Popular Benchmarks

Benchmarked results with SoTA MLLMs. Compared with counterparts, our \method~achieves highly competitive results than current small-scale MLLM models. AVG: The average of the nine benchmarks for comprehensive comparison except MMMU. $^\dagger$: reproduced results using the official code. comparison_llavakd


🛠️ Installation

LLaVA-KD Weights

Model Vision Encoder LLM CKPTs
LLaVA-KD-1B siglip-so400m-patch14-384 Qwen/Qwen1.5-0.5B LLaVA-KD-Base-siglip-Qwen1.5-0.5B
LLaVA-KD-2B siglip-so400m-patch14-384 Qwen/Qwen1.5-1.8B LLaVA-KD-Base-siglip-Qwen1.5-1.8B

:computer: Evaluation

Please evaluate the model according to Evaluation.md.

Quickstart

Download the Pre-trained VisualEnc, LLM, LLaVAKD weights to the ./pretrained_ckpt. And then:

  python quick_inference.py --model_path ./pretrained_ckpt/LLaVAKD_Model_Path --image_file ./image_test/img_test_1.jpg  --query "What is that orange thing behind the girl?"

accuracy

:ballot_box_with_check: TODO List

:dizzy: Citation

If you find this code useful, don't forget to star the repo and cite the paper.

@article{cai2024llava,
  title={LLaVA-KD: A Framework of Distilling Multimodal Large Language Models},
  author={Cai, Yuxuan and Zhang, Jiangning and He, Haoyang and He, Xinwei and Tong, Ao and Gan, Zhenye and Wang, Chengjie and Bai, Xiang},
  journal={arXiv preprint arXiv:2410.16236},
  year={2024}
}

💘 Acknowledgements

We thank the great works TinyLLaVA, LLaVA for providing assistance for our research.