Why don't to share the parameters backbone between Image and Video?

PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

https://arxiv.org/abs/2310.01852

MIT License

549 stars 44 forks source link

Why don't to share the parameters backbone between Image and Video? #28

Closed SCZwangxiao closed 4 months ago

SCZwangxiao commented 4 months ago

In the code, the image and video encoder are initialized from the same model, but trained separately. Does it make performance better?

LinB203 commented 4 months ago

Thank you for your attention, usually decoupling modal to train expert models would work better, however we did not do ablation experiments in this regard.