Open 309020726 opened 1 month ago
Hello, thank you for the interest. For the text-based video retrieval, you can use the following pseudo code:
videos = torch.randn((N, 3, H, W)) # multiple videos
visual_embeds = model(videos, None, mode='video')['img_emb']
text_tokens = surgvlp.tokenize(['you query'], device=device)
text_query_embed = model(None, text_tokens, mode='text')['text_emb']
logits_qurey = 100.0 * text_query_embed @ visual_embeds.T
The logits_query is a tensor of shape (N) indicating the similarity of each video to the given query
Hello, how can I use this to implement the "text-based video retrieval" function mentioned in the abstract?