Closed cyChen2003 closed 7 months ago
Hi, thank you very much for your interest.
As mentioned in our paper, we followed ImageBind's setup to ensure fair comparison. Therefore, we did not train ViT-Lens on video data, but followed ImageBind to use the CLIP-ViT to aggregate frame(image) representations for a video clip, and fuse it with the corresponding audio representation (trained by ViT-Lens). You may find this in the ImageBind paper Appendix C. I will allocate time in the future to clean and upload this section of the code.
I'm thoroughly impressed with your project and I'm eager to apply the model to my video data. However, the current TRAIN_INFERENCE.md does not provide the relevant usage information. Could you kindly publish the associated training methodologies or experimental approaches? Your assistance would be greatly appreciated. Thank you!