farewellthree / STAN

Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring"
Apache License 2.0
90 stars 3 forks source link

Intermediate structure XCLIP used for Recognition, how to retrieval #22

Open Lucky-Light-Sun opened 4 months ago

Lucky-Light-Sun commented 4 months ago

Hi, I notice Intermediate structure XCLIP is used for RECOGNITION task and the official code is not used for retrieval task. So I want to ask how do you get the X-CLIP retrieval@1 metric? If you do the experiment by yourself, can you please give me the code? Or please give the refering paper and code.

Looking forward to your reply.

Best wishes!

image

farewellthree commented 4 months ago

The retrieval code for XCLIP is held by my previous company, but I have been away for a long time, making it difficult to access these codes. Additionally, the past code was based on MMCV1.0 and is incompatible with the current version. However, replicating it is simple. We did not utilize XCLIP's prompting and MIT modules, while only using the CCT module that inserts message tokens into the backbone. We only need to make slight modifications to the VIT block of CLIP, see the CrossFramelAttentionBlock here.