Question Regarding Video Frame Processing

I have read your paper and found it very insightful. However, I am still unclear about how to handle video processing as described in the paper. The paper mentions extracting 8 frames from a video, but it is not clear what the next steps are. Are these 8 frames concatenated into a single image, or is there another method you use to process these frames?

I would appreciate it if you could provide more details or point me to any additional resources or code examples that could help clarify this.

Thank you for your time and assistance.

PKU-YuanGroup / Video-LLaVA

Question Regarding Video Frame Processing #169