the generalization performance is bad when testing on custom videos.

RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

https://arxiv.org/abs/2312.02051

BSD 3-Clause "New" or "Revised" License

267 stars 23 forks source link

the generalization performance is bad when testing on custom videos. #8

Closed dragen1860 closed 6 months ago

dragen1860 commented 7 months ago

Hi, thanks for publishing the work. Although its an really insightful work, but i have to say, the performance is not so good when talking about generalization. I tried some custom in the wild video and the timestamp is unreasonable and the summarization text is also not relevant with video. Have you tested on your own videos? does it perform good? .

RenShuhuai-Andy commented 7 months ago

Hi, thanks for your interest.

Can you show the specific case you tested, including screenshot of the key frame, duration of the video, prompt, and the model output? As we shown in Table 2, the zero-shot performance of the model is indeed far from satisfaction in actual usage, However, the model is not expected to produce content that is entirely irrelevant.