Open pritamqu opened 1 month ago
In theory, we can input a fairly long video (at least above the hour level), because we compress the video token into 96 before entering the llm, in practice, I recommend you to modify the code, divide a long video into multiple short video processing will get better results, for example, for a 64s video, we will divide it into 8 segments and send it to 8x96 tokens. We will update our long video version -VideoChat-NeXT - within a month, so stay tuned
do you suggest segmenting and concatenating video embeddings like this just for inference, even if the model has not been trained in a similar fashion?
We will update our long video version -VideoChat-NeXT - within a month, so stay tuned
any news?
Could you please confirm which video LLM is most suited for long videos from this list: https://huggingface.co/collections/OpenGVLab/internvideo2-6618ccb574bd2f91410df5cd
my guess is InternVideo2-Chat-8B-InternLM as the LLM has higher context. Also, what is the maximum supported number of frames?