bigai-nlco / VideoLLaMB

Official Repository of VideoLLaMB: Long Video Understanding with Recurrent Memory Bridges
https://videollamb.github.io/
30 stars 0 forks source link

About STREAMING CAPTION #3

Closed jun0wanan closed 4 days ago

jun0wanan commented 5 days ago

Hello,

Thank you for sharing such excellent code; it looks very solid overall! I'm particularly interested in the stream module. I'm curious how it works without needing an EOS prediction—what mechanism does it use? For instance, does it output a caption as soon as it detects a change in the video frames? I was wondering if you could explain this in more detail, as it's not fully covered in the paper.

Hope to get your reply :>!

Best wishes

patrick-tssn commented 4 days ago

Thank you for your interest! This is actually a product of our SceneTilling Algorithm (Section 2.1). If you are familiar with streaming video captioning (https://arxiv.org/abs/2404.01297), you'll find that the EOS in these works is labeled by the end of an action. During inference, they predict the eos with a tunable threshold. I think these strategies are similar to the rules ;-). Therefore, in this work, instead, we use the SceneTilling Algorithm to automatically detect the possible transition points of video scenes and predict captions at these transition points. I hope my explanation addresses your confusion.

jun0wanan commented 4 days ago

Thank you for your interest! This is actually a product of our SceneTilling Algorithm (Section 2.1). If you are familiar with streaming video captioning (https://arxiv.org/abs/2404.01297), you'll find that the EOS in these works is labeled by the end of an action. During inference, they predict the eos with a tunable threshold. I think these strategies are similar to the rules ;-). Therefore, in this work, instead, we use the SceneTilling Algorithm to automatically detect the possible transition points of video scenes and predict captions at these transition points. I hope my explanation addresses your confusion.

Thanks !