Closed SCZwangxiao closed 4 months ago
In the code, the image and video encoder are initialized from the same model, but trained separately. Does it make performance better?
Thank you for your attention, usually decoupling modal to train expert models would work better, however we did not do ablation experiments in this regard.
In the code, the image and video encoder are initialized from the same model, but trained separately. Does it make performance better?