TencentARC / ViT-Lens

[CVPR 2024] ViT-Lens: Towards Omni-modal Representations
https://ailab-cvc.github.io/seed/vitlens/
Other
155 stars 10 forks source link

Something about Training Methodologies and Experimental Approaches for Video Data #8

Closed cyChen2003 closed 7 months ago

cyChen2003 commented 7 months ago

I'm thoroughly impressed with your project and I'm eager to apply the model to my video data. However, the current TRAIN_INFERENCE.md does not provide the relevant usage information. Could you kindly publish the associated training methodologies or experimental approaches? Your assistance would be greatly appreciated. Thank you!

StanLei52 commented 7 months ago

Hi, thank you very much for your interest.

As mentioned in our paper, we followed ImageBind's setup to ensure fair comparison. Therefore, we did not train ViT-Lens on video data, but followed ImageBind to use the CLIP-ViT to aggregate frame(image) representations for a video clip, and fuse it with the corresponding audio representation (trained by ViT-Lens). You may find this in the ImageBind paper Appendix C. I will allocate time in the future to clean and upload this section of the code.