LiuRicky / ts2_net

[ECCV2022] A pytorch implementation for TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
75 stars 9 forks source link

About Inverted Softmax #7

Closed tiesanguaixia closed 1 year ago

tiesanguaixia commented 1 year ago

Hi, thank you for your great work! In your modeling.py line 290 to line 311, it seems that you calculate 2 loss, and it's different from the original CLIP4Clip.

if self.training:
            loss = 0.
            ### TODO: need to simplify the code to calculate similarity ####
            sim_matrix_semantic, sim_matrix_global = self.get_similarity_logits(sequence_output, visual_output, attention_mask, video_mask, shaped=True, loose_type=self.loose_type)
            # text2video
            sim_loss1 = self.loss_fct(sim_matrix_semantic)
            # video2text
            sim_loss2 = self.loss_fct(sim_matrix_semantic.T)
            sim_loss_semantic = (sim_loss1 + sim_loss2) / 2
            loss = loss + self.frame_match_weight*sim_loss_semantic

            # text2video
            sim_loss1 = self.loss_fct(sim_matrix_global)
            # video2text
            sim_loss2 = self.loss_fct(sim_matrix_global.T)
            sim_loss_global = (sim_loss1 + sim_loss2) / 2
            loss = loss + (1-self.frame_match_weight)*sim_loss_global

            return loss
else:
            return None

Did you use Inverted Softmax in the code? What does the sim_matrix_semantic and sim_matrix_global mean?

LiuRicky commented 1 year ago

Hi, thank you for your great work! In your modeling.py line 290 to line 311, it seems that you calculate 2 loss, and it's different from the original CLIP4Clip.

if self.training:
            loss = 0.
            ### TODO: need to simplify the code to calculate similarity ####
            sim_matrix_semantic, sim_matrix_global = self.get_similarity_logits(sequence_output, visual_output, attention_mask, video_mask, shaped=True, loose_type=self.loose_type)
            # text2video
            sim_loss1 = self.loss_fct(sim_matrix_semantic)
            # video2text
            sim_loss2 = self.loss_fct(sim_matrix_semantic.T)
            sim_loss_semantic = (sim_loss1 + sim_loss2) / 2
            loss = loss + self.frame_match_weight*sim_loss_semantic

            # text2video
            sim_loss1 = self.loss_fct(sim_matrix_global)
            # video2text
            sim_loss2 = self.loss_fct(sim_matrix_global.T)
            sim_loss_global = (sim_loss1 + sim_loss2) / 2
            loss = loss + (1-self.frame_match_weight)*sim_loss_global

            return loss
else:
            return None

Did you use Inverted Softmax in the code? What does the sim_matrix_semantic and sim_matrix_global mean?

Thanks for your attention. This because we want to calculate the similarity in a coarse-to-fine manner at the begin. But as shown in line 263, we set self.frame_match_weight = 1.0, which means only one loss is actually used in the forward and backward pass. However, we keep this code from line 290 to 311, which you can explore how to make a matching in a coarse-to-fine manner.

tiesanguaixia commented 1 year ago

Hi, thank you for your great work! In your modeling.py line 290 to line 311, it seems that you calculate 2 loss, and it's different from the original CLIP4Clip.

if self.training:
            loss = 0.
            ### TODO: need to simplify the code to calculate similarity ####
            sim_matrix_semantic, sim_matrix_global = self.get_similarity_logits(sequence_output, visual_output, attention_mask, video_mask, shaped=True, loose_type=self.loose_type)
            # text2video
            sim_loss1 = self.loss_fct(sim_matrix_semantic)
            # video2text
            sim_loss2 = self.loss_fct(sim_matrix_semantic.T)
            sim_loss_semantic = (sim_loss1 + sim_loss2) / 2
            loss = loss + self.frame_match_weight*sim_loss_semantic

            # text2video
            sim_loss1 = self.loss_fct(sim_matrix_global)
            # video2text
            sim_loss2 = self.loss_fct(sim_matrix_global.T)
            sim_loss_global = (sim_loss1 + sim_loss2) / 2
            loss = loss + (1-self.frame_match_weight)*sim_loss_global

            return loss
else:
            return None

Did you use Inverted Softmax in the code? What does the sim_matrix_semantic and sim_matrix_global mean?

Thanks for your attention. This because we want to calculate the similarity in a coarse-to-fine manner at the begin. But as shown in line 263, we set self.frame_match_weight = 1.0, which means only one loss is actually used in the forward and backward pass. However, we keep this code from line 290 to 311, which you can explore how to make a matching in a coarse-to-fine manner.

Okay, thanks a lot! Could you provide the sh file of ViT-B/16 which produce the result of your paper? Does the learning rate, Top K and other training parameters differs from that of ViT-B/32? Thank you very much for your work!

LiuRicky commented 1 year ago

Hi, thank you for your great work! In your modeling.py line 290 to line 311, it seems that you calculate 2 loss, and it's different from the original CLIP4Clip.

if self.training:
            loss = 0.
            ### TODO: need to simplify the code to calculate similarity ####
            sim_matrix_semantic, sim_matrix_global = self.get_similarity_logits(sequence_output, visual_output, attention_mask, video_mask, shaped=True, loose_type=self.loose_type)
            # text2video
            sim_loss1 = self.loss_fct(sim_matrix_semantic)
            # video2text
            sim_loss2 = self.loss_fct(sim_matrix_semantic.T)
            sim_loss_semantic = (sim_loss1 + sim_loss2) / 2
            loss = loss + self.frame_match_weight*sim_loss_semantic

            # text2video
            sim_loss1 = self.loss_fct(sim_matrix_global)
            # video2text
            sim_loss2 = self.loss_fct(sim_matrix_global.T)
            sim_loss_global = (sim_loss1 + sim_loss2) / 2
            loss = loss + (1-self.frame_match_weight)*sim_loss_global

            return loss
else:
            return None

Did you use Inverted Softmax in the code? What does the sim_matrix_semantic and sim_matrix_global mean?

Thanks for your attention. This because we want to calculate the similarity in a coarse-to-fine manner at the begin. But as shown in line 263, we set self.frame_match_weight = 1.0, which means only one loss is actually used in the forward and backward pass. However, we keep this code from line 290 to 311, which you can explore how to make a matching in a coarse-to-fine manner.

It is the same setting with vit-b/32. Only change the pretrained model weight and patch size.