RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
267 stars 23 forks source link

Question about the output of the time-aware frame encoder #28

Closed Mingxiao-Li closed 3 months ago

Mingxiao-Li commented 4 months ago

Hi, first I would like to extend my sincere gratitude for your work in this field. This is a very interesting and exciting work. Below is my question:

my question is about the output of the timestamp-aware frame encoder described in the paper. In Figure 2 and the second paragraph of the "Timestamp-aware frame encoder" section, it is mentioned that the encoder's output is visual tokens, with a default configuration of 32 tokens. However, when I run the code, the output also appears to include timestamp tokens.

I discovered this by setting a breakpoint at line 333 in timechat.py and examining the shapes of frame_hidden_state (4,96,47,768), query_tokens (384,32,768), and timestamps_attention_mask (384,15). The results suggest that the encoder's output is a sequence containing both visual and timestamp tokens (32+15=47 tokens), which seems to differ from what is stated in the paper.

Could you please confirm whether the output is supposed to include timestamp tokens, or if there might be an error in the process I used to check the shapes?

Many thanks in advance.

RenShuhuai-Andy commented 4 months ago

Hi, thanks for your interest.

Yes, you are right... Our current output sequence contains both visual and timestamp tokens. According to InstructBLIP, I think the timestamp tokens should be removed after the encoder, but I didn't take care of it. My bad.

I'm not sure if it will have a negative impact on the performance, but I will take a look at it if I have time.

tongda commented 3 months ago

I also noticed this issue.

From my observation in dense captioning task, event time may have a fair possibility to shift for 1 frame (sampled frame). I guess the reason is that the timestamp tokens is not fused into image embeds and the previous frame's timestamp is treated as the next frame's in LLM. Besides, I guess that Positional Embedding in video Qformer may have help on covering this issue.