Details of sliding qformer operation

RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

https://arxiv.org/abs/2312.02051

BSD 3-Clause "New" or "Revised" License

267 stars 23 forks source link

Details of sliding qformer operation #11

Closed jihwanp closed 6 months ago

jihwanp commented 6 months ago

Hi, thanks for providing wonderful work.

In the paper, I cannot find the details of sliding qformer operations. How do output queries in qformer from different frames interact with each other? Which module is used for interaction?

RenShuhuai-Andy commented 6 months ago

Hi, thanks for your interest.

The usage of sliding video-qformer is in: https://github.com/RenShuhuai-Andy/TimeChat/blob/master/timechat/models/timechat.py#L327-L345