Intervideo2 for long video

OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

Apache License 2.0

1.44k stars 88 forks source link

Intervideo2 for long video #197

Open pritamqu opened 1 month ago

pritamqu commented 1 month ago

Could you please confirm which video LLM is most suited for long videos from this list: https://huggingface.co/collections/OpenGVLab/internvideo2-6618ccb574bd2f91410df5cd

my guess is InternVideo2-Chat-8B-InternLM as the LLM has higher context. Also, what is the maximum supported number of frames?

leexinhao commented 1 month ago

In theory, we can input a fairly long video (at least above the hour level), because we compress the video token into 96 before entering the llm, in practice, I recommend you to modify the code, divide a long video into multiple short video processing will get better results, for example, for a 64s video, we will divide it into 8 segments and send it to 8x96 tokens. We will update our long video version -VideoChat-NeXT - within a month, so stay tuned

pritamqu commented 1 month ago

do you suggest segmenting and concatenating video embeddings like this just for inference, even if the model has not been trained in a similar fashion?

choyakawa commented 2 weeks ago

We will update our long video version -VideoChat-NeXT - within a month, so stay tuned

any news?