HumamAlwassel / XDC

Self-Supervised Learning by Cross-Modal Audio-Video Clustering (NeurIPS 2020)
http://humamalwassel.com/publication/xdc/
MIT License
90 stars 9 forks source link

How to verify the accuracy of XDC #6

Closed FangmingZhou closed 3 years ago

FangmingZhou commented 3 years ago

Thanks for your great work and nice sharing! I am trying to utilize the pre-trained model ''r2plus1d_18_xdc_ig65m_kinetics' to extract video feature for text-video matching.

When I used the video feature 'irCSN' in VMZ, the result mAP is about 0.18. But poor result 0.08 was shown when using XDC instead.

Since the FC weights are not provided in the project, I can't quickly verify that I am using the XDC model correctly. (That is to say, if the FC weights are provided, I can use the model that to classify a test video, then check the classification result )

So, could you help me to verify the model, and utilize the GREAT model :)

HumamAlwassel commented 3 years ago

Hi @FangmingZhou,

Thanks for your interest in our work. The FC layers of XDC are specific to the clustering task solved by XDC, so it cannot be tested on a video as these annotations are not shared. Keep in mind that the irCSN model on VMZ is a much deeper model than r2plus1d_18_xdc_ig65m_kinetics. So it might not be surprising that you have better results on the text-video matching task using this irCSN.

In any case, make sure of the following before using the XDC pretrained model:

Cheers!

FangmingZhou commented 3 years ago

Hi @FangmingZhou,

Thanks for your interest in our work. The FC layers of XDC are specific to the clustering task solved by XDC, so it cannot be tested on a video as these annotations are not shared. Keep in mind that the irCSN model on VMZ is a much deeper model than r2plus1d_18_xdc_ig65m_kinetics. So it might not be surprising that you have better results on the text-video matching task using this irCSN.

In any case, make sure of the following before using the XDC pretrained model:

  • load the model following the instructions in the README.md
  • preprocess your dataset similar to what we do in the XDC paper (i.e. same fps, spatial crop, .. etc)
  • make sure to normalize your input clips using the mean and std from here

Cheers!

Nice! thank you!