Vision-CAIR / LongVU

https://vision-cair.github.io/LongVU
303 stars 21 forks source link

Question: Video vs Frames? #24

Open cbasavaraj opened 4 days ago

cbasavaraj commented 4 days ago

Hi, I'm running the code on my machine and it works fine. I notice that you are sampling 1 frame / second. So there is no smarter way of sampling frames only when needed, for example, a scene change? I have read the paper and know that such things are done within the model, but the initial sampling is just 1 frame / second?

xiaoqian-shen commented 3 days ago

Yes, we initially sample video at 1 fps and then decide which frames to reduce based on extracted features representation.

cbasavaraj commented 2 days ago

Thank you!

Some more questions on how the chat works: In the provided app.py with the chat interface on the browser, I am loading one video and asking multiple questions. Each turn takes about the same time for inference. So, the visual elements are processed every time, and the previous turns' intermediate outputs are not reused? Also, is the text chat history (questions and answers) reused for later queries?