CAMMA-public / SurgVLP

Learning multi-modal representations by watching hundreds of surgical video lectures
36 stars 2 forks source link

How to use? #3

Open 309020726 opened 2 weeks ago

309020726 commented 2 weeks ago

Hello, how can I use this to implement the "text-based video retrieval" function mentioned in the abstract?

Flaick commented 1 day ago

Hello, thank you for the interest. For the text-based video retrieval, you can use the following pseudo code:

videos = torch.randn((N, 3, H, W)) # multiple videos
visual_embeds = model(videos, None, mode='video')['img_emb'] 
text_tokens = surgvlp.tokenize(['you query'], device=device)
text_query_embed = model(None, text_tokens, mode='text')['text_emb']     
logits_qurey = 100.0 * text_query_embed @ visual_embeds.T

The logits_query is a tensor of shape (N) indicating the similarity of each video to the given query