Any Updates? - Githubissues

Hi, We have just introduced InternVid, which is a large-scale video-centric multimodal dataset that enables the learning of powerful and transferable video-text representations for multimodal understanding and generation. In addition, we have also presented ViCLIP(You can treat it as a video-clip.), a video-text representation learning model based on ViT-L. This model has been trained on InternVid using contrastive learning, and it showcases leading zero-shot action recognition and competitive video retrieval performance. The dataset consists of over 7 million videos, with a total duration of nearly 760K hours, resulting in 234M video clips accompanied by detailed descriptions totaling 4.1B words. You can find more information about it in our paper at https://arxiv.org/abs/2307.06942.

LAION-AI / video-clip

Any Updates? #4