PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
549 stars 44 forks source link

finetuning on a classification task #35

Open Sravanthgithub opened 3 months ago

Sravanthgithub commented 3 months ago

Hey, I have some data of images and videos and i want these to get alligned with text. My usecase is just a binary classification. So, my texts are nothing but two sentences - 'The data is live' , 'The data is non live'. So, basically i wanted to increase my model's performance by utilising a multi-modality model. How do i do this? Any resources?