How to use? - Githubissues

Hello, thank you for the interest. For the text-based video retrieval, you can use the following pseudo code:

videos = torch.randn((N, 3, H, W)) # multiple videos
visual_embeds = model(videos, None, mode='video')['img_emb'] 
text_tokens = surgvlp.tokenize(['you query'], device=device)
text_query_embed = model(None, text_tokens, mode='text')['text_emb']     
logits_qurey = 100.0 * text_query_embed @ visual_embeds.T

The logits_query is a tensor of shape (N) indicating the similarity of each video to the given query

CAMMA-public / SurgVLP

How to use? #3