Jiamian-Wang / T-MASS-text-video-retrieval

Official implementation of "Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval (CVPR 2024 Highlight)"
56 stars 2 forks source link

About the training scripts. #1

Closed Uason-Chen closed 6 months ago

Uason-Chen commented 6 months ago

Thank you to the author for sharing the open-source code. I noticed that the official training scripts have slightly different settings for different datasets. For example, the MSRVTT dataset uses support_loss_weight, but the other two datasets do not. For the LSMDC dataset, the stochastic prior is set to normal, and std is set to 3e-3, but these settings are not applied to the other two datasets. For the DiDeMo dataset, there are no settings for support_loss_weight, stochastic prior, and std. I would like to know if it is indeed necessary to slightly modify the training parameters when training on different datasets, or if there are errors in the current training scripts?

Jiamian-Wang commented 6 months ago

Thank you for your interest in this work!

We found that different datasets behave differently in the CLIP joint embedding space and thus provide the customized settings to (1) further achieve the performance boost and (2) encourage the in-depth analysis and future explorations. There is a potential for the proposed T-MASS to achieve even better performance by trying more settings.

Please kindly consider star or fork this repo if you find the code helps. Much appreciated!

Uason-Chen commented 6 months ago

Thanks for quick response. My problem is solved. I will close this issue and star this repo.

Jiamian-Wang commented 6 months ago

No worries. Thank you!