Long video test results did not meet expectations

RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

https://arxiv.org/abs/2312.02051

BSD 3-Clause "New" or "Revised" License

267 stars 23 forks source link

Long video test results did not meet expectations #38

Closed ffiioonnaa closed 2 months ago

ffiioonnaa commented 3 months ago

Hi, thanks for your work! When I used the demo script to test the highlight detection task and temporal grounding task on my customed video, I found that the output timestamp is different every time I run it, and when I input a 30min long video, the output timestamp is often small,such as 1x second or 1xx second

rahulkrprajapati commented 3 months ago

Hey @ffiioonnaa , I also ran in the same issue. It might be due to the sampling capped at 96 frames. Changing the sampling rate would affect accuracy. I actually split the video into 2 - 5 min chunks and then ran the same prompt on each video and adjusted for the time difference for each video by adding the number of seconds that had passed in the previous chunks.

The accuracy and the timestamps were still not too good for me but it does seem to perform better this way for longer videos.

RenShuhuai-Andy commented 3 months ago

hi, thanks for your interests.

As shown in table.1 in our paper, the average video duration of training data is 190 seconds. Therefore, the model performs better on videos around 190 seconds long. When the video duration is too long (such as half an hour), the model's performance may deteriorate.