Vision-CAIR / MiniGPT4-video

Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understanding
https://vision-cair.github.io/Goldfish_website/
BSD 3-Clause "New" or "Revised" License
559 stars 60 forks source link

Have you tried a small language model, such as Tinylama or phi-2? #4

Closed noah003 closed 7 months ago

noah003 commented 7 months ago

as title described

KerolosAtef commented 7 months ago

Hello @noah003 , tinylama and phi-2 both of them have context window of 2048 which is not a good idea to try for videos as in videos we need to have a large context window to accept more frames. We tried llama 2 with context window 4096 and this accepts 45 frames, and mistral 8192 and this accept 90 frames (you can the details here in implementation details MiniGPT4-video). You can try tinylama and phi-2 to be efficient in the number of parameters and fast but this will lead to sample only 22 frames from each video which will lead to lose more information.

I suggest if you want to change the LLM you can try one with larger context window to accept more frames, but you should take care of memory limits while training.

noah003 commented 7 months ago

Hello @noah003 , tinylama and phi-2 both of them have context window of 2048 which is not a good idea to try for videos as in videos we need to have a large context window to accept more frames. We tried llama 2 with context window 4096 and this accepts 45 frames, and mistral 8192 and this accept 90 frames (you can the details here in implementation details MiniGPT4-video). You can try tinylama and phi-2 to be efficient in the number of parameters and fast but this will lead to sample only 22 frames from each video which will lead to lose more information.

I suggest if you want to change the LLM you can try one with larger context window to accept more frames, but you should take care of memory limits while training.

Thanks for your answer