RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
267 stars 23 forks source link

What is the relationship between segment and timetoken? #17

Closed sunwhw closed 5 months ago

sunwhw commented 5 months ago

Hi, thanks for you great work! I want to ask what is the relationship between original segment(contained in seg_prompt) and timetoken

image
sunwhw commented 5 months ago

Is asr data used when constructing instruction data?

RenShuhuai-Andy commented 5 months ago

Hi, thanks for your interest.

The time token is used in Vid2seq, which uses relative timestampes. Specifically, it quantizes any video of duration $T_i$ into 100 equally-spaced timestamps. Accordingly, <time_token_36> represents the time of 36/100 * $T_i$ seconds.

In contrast, we use absolute timestamps, i.e., original segments contained in seg_prompts. The first number in seg_prompts denotes the duration (seconds) of current video. After that, each two numbers represent the start and end time of a fragment.

The training on TimeIT-104k uses asr data, while the fine-tuning or evaluation on Youcook2, Charades, and qvhighlight do not use asr data.

sunwhw commented 4 months ago

oh, thanks for your clear reply!