Clarification questions about the framework

PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

MIT License

537 stars 43 forks source link

I'm trying to understand this in context of other works in the ecosystem. For example, I'm interested in video. For the video encoder, there is the LoRa tuned and the Fully-finetuned, can I use the embeddings from these models with an already trained LLM or model? Can I use these embeddings with Video-Llava? Can I use the LanguageBind encoder as a replacement for Video-Llava encoder (video tower)?

Also the demos shown in gradio, only show modality comparisons. I'm also trying to understand how do you do zero shot classifications. Thank you -- someone who is confused but excited and thankful for for the work done.

PKU-YuanGroup / LanguageBind

Clarification questions about the framework #50