DAMO-NLP-SG / Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
BSD 3-Clause "New" or "Revised" License
2.7k stars 243 forks source link

Frame-aware? #142

Closed jayavanth closed 6 months ago

jayavanth commented 7 months ago

Hello! I wanted to know if this model is frame aware? Can I ask questions like "when does the person wearing yellow jacket appear in this video?" The demo on hugginface is giving me inaccurate results for such queries

lixin4ever commented 6 months ago

Thank you for your attention. Technically it is frame-aware because we add absolute frame positional embeddings over the frame tokens, however, as the training data for teaching the model to be aware of different frames is rare, this capability is supposed to be very weak.