Jiamian-Wang / T-MASS-text-video-retrieval

Official implementation of "Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval (CVPR 2024 Highlight)"
47 stars 0 forks source link

About inference similarity computation in 'sim_matrix_inference_stochastic' #4

Closed musicman217 closed 2 months ago

musicman217 commented 3 months ago

hello author, thank you for sharing your excellent work.

when I tried to understand how you compute the similarity between fused video embeds and num_txts generated stochastic text embeddings from a pooled video embedding, I had a doubt about permute operation vid_embeds_pooled_per_video_id = vid_embeds_pooled_per_video_id.permute(1, 2, 3, 0)

this pooled video embeds' shape is (b, a, 1, 512), which means every video has been fused with all text embeds in the whole validation set, and a video has a fused video embeds, where b is num_vids and a is num_txts and in validation they are same.

after permute operation, the pooled video tensor shape is (a, 1, 512, b), which means evey text has fused video embeddings with all videos embeddings respectivelybeen fused with video embeds and obtained num_vids fused video embeds.

and the question is in bmm operation : sims = torch.bmm(text_embeds_per_video_id,vid_embeds_pooled_per_video_id) the shape are (b, a, 512) and (a, 512, b) for text and video respectively, and the batch multiply operation means for a postive sample pairs in validation set, i.g. i th video and i th text. For i th video's j th stochastic text embeds, it needs to compute similarities with all fused video embeds that was fused with i th text embed.

This thought of combination reallly confused me for a while. Why the pooled video tensor needs to permute to shape of (a, 1, 512, b) and why not straightly performs bmm operation : (b, a, 512) x (b, 512, a) -> (b, a, a) for text and video respectivelyreally confused me for a while. Why the pooled video tensor needs to permute to shape of (a, 1, 512, b) and why not straightly performs bmm operation : (b, a, 512) x (b, 512, a) -> (b, a, a) which means for i th video and i th text pair, each of stochastic text embed -- generated by i th video embed and its corresponding text embed -- needs to compute similarities between all fused video embeds generated by i th text embed and all video embeds ?

Sincerely to waiting for your reply.

the below is doubted part function of sim_matrix_inference_stochastic

`

num_txts, num_vids, max_text_per_vid, embed_dim = text_embeds_per_video_id.shape # (b,a=b,1,512)

vid_embeds_pooled_per_video_id = vid_embeds_pooled_per_video_id.permute(1, 2, 3, 0) # (a,1,512,b)
vid_embeds_pooled_per_video_id = vid_embeds_pooled_per_video_id.reshape(num_vids * max_text_per_vid, embed_dim,
                                                                        num_vids) # (a,512,b)
text_embeds_per_video_id = text_embeds_per_video_id.permute(0, 2, 1, 3) # (b,1,a,512)
text_embeds_per_video_id = text_embeds_per_video_id.reshape(num_vids * max_text_per_vid, num_txts, embed_dim) # (b,a,512)

sims = torch.bmm(text_embeds_per_video_id,vid_embeds_pooled_per_video_id)
# (b,a,512)x(a,512,b)->(b=a,a,b) , b=a means num_vids == num_txts in validation set

sims = sims.view(num_vids, max_text_per_vid, num_txts, num_vids) # (b=a,1,a,b)
sims_diag = torch.stack([sims[i, :, :, i] for i in range(sims.shape[0])], dim=-1)
print(f'>>>check sims_diag={sims_diag.shape}')
sims_diag = sims_diag.permute(1, 0, 2)

`

musicman217 commented 3 months ago

meanwhile, the stack operation sims_diag = torch.stack([sims[i, :, :, i] for i in range(sims.shape[0])], dim=-1) for i th video, it chooses the similarities of fused video embed of i th video and i th text embed, where the chosen fused video embed was computed for similarity from all stochastic text embeds generated by i th video and all text embeds.

finally, we obtained a matrix, where matrix(i, j) means i th text generated a stochastic text embed -- with the help of j th video that for generating radius R -- and it computes the similarity with fused video embed generated by j th video and j th text embed.

for that choosing the fused video token -- generated by j th video and j th text embed -- representing the original j th video embed for retrievaling, whether there exists data leakage problem or not?

Jiamian-Wang commented 2 months ago

Hi there,

Thank you for your interest in this work and I'm sorry for my late reply.

I provide the logics behind the implementation of the function sim_matrix_inference_stochastic() as the following figure, to help you better understand how this works: implementation_logics

I also add some comments to the corresponding code snippet to match the figure as follows: implementation_logics_code

Here are some additional comments to this question:

Please let me know if above illustration can help solve your concerns. Any further discussions and questions are welcomed!

musicman217 commented 2 months ago

Hi there,

Thank you for your interest in this work and I'm sorry for my late reply.

I provide the logics behind the implementation of the function sim_matrix_inference_stochastic() as the following figure, to help you better understand how this works: implementation_logics

I also add some comments to the corresponding code snippet to match the figure as follows: implementation_logics_code

Here are some additional comments to this question:

  • One way to check if there is data leakage is to ensure that given N piars text-video inputs, the output should be always N by N (rather than N), so that the "paired information" is not employed.
  • For both t2v and v2t, each time, we are using the "given query video to process the text", this does not employ the "paired information" and thus will not bring the data leakage.
  • In our implementation, we actually compute the whole cube (L59) and consider all possibilities.
  • Empirically, we previously see that the performance of this method is extremely high (nearly perfect) if there is data leakage.
  • Both figures can be found at /figures/ of this repo.

Please let me know if above illustration can help solve your concerns. Any further discussions and questions are welcomed!

Thank you for your patient reply! Now I get your implementation logics, perfectly solved my concerns.