Closed OPilgrim closed 4 months ago
LanguageBind and LLaVA are both multi-modal models, but they have different focuses. LanguageBind is designed to align individual modalities to text space. LLaVA maps pictures to LLM for image-text conversations. I think connecting the encoder of each modality of LanguageBind to LLM is more conducive to multi-modal LLM. Because they have been aligned to a textual semantic space in advance. In addition, our Video-LLaVA is a good proof that it achieves strong performance after aligning video images in advance through LanguageBind encoder. https://github.com/PKU-YuanGroup/Video-LLaVA
Thank you very much for your reply. It helps me a lot.
Hello! Your LanguageBind is amazing! But I'm new to multimodality, and I was wondering what's the difference between LanguageBind and LLaVA-1.5? Should I use LLaVA-1.5 or LanguageBind if I want my model to have more reasoning power while handling multimodal input (currently, text, image, and video are the three modes at most)? Considering that LanguageBind may be a better choice if other modes are to be added in the future, can LanguageBind be easily combined with LLaVA-1.5, LLaMA, or etc.? I'd like to hear your views on these issues.