Hey, I have some data of images and videos and i want these to get alligned with text. My usecase is just a binary classification. So, my texts are nothing but two sentences - 'The data is live' , 'The data is non live'. So, basically i wanted to increase my model's performance by utilising a multi-modality model. How do i do this? Any resources?
Hey, I have some data of images and videos and i want these to get alligned with text. My usecase is just a binary classification. So, my texts are nothing but two sentences - 'The data is live' , 'The data is non live'. So, basically i wanted to increase my model's performance by utilising a multi-modality model. How do i do this? Any resources?