PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
549 stars 44 forks source link

What's the difference between LanguageBind and LLaVA-1.5 #26

Closed OPilgrim closed 4 months ago

OPilgrim commented 5 months ago

Hello! Your LanguageBind is amazing! But I'm new to multimodality, and I was wondering what's the difference between LanguageBind and LLaVA-1.5? Should I use LLaVA-1.5 or LanguageBind if I want my model to have more reasoning power while handling multimodal input (currently, text, image, and video are the three modes at most)? Considering that LanguageBind may be a better choice if other modes are to be added in the future, can LanguageBind be easily combined with LLaVA-1.5, LLaMA, or etc.? I'd like to hear your views on these issues.

LinB203 commented 5 months ago

LanguageBind and LLaVA are both multi-modal models, but they have different focuses. LanguageBind is designed to align individual modalities to text space. LLaVA maps pictures to LLM for image-text conversations. I think connecting the encoder of each modality of LanguageBind to LLM is more conducive to multi-modal LLM. Because they have been aligned to a textual semantic space in advance. In addition, our Video-LLaVA is a good proof that it achieves strong performance after aligning video images in advance through LanguageBind encoder. https://github.com/PKU-YuanGroup/Video-LLaVA

OPilgrim commented 5 months ago

Thank you very much for your reply. It helps me a lot.