RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
267 stars 23 forks source link

the performance is very low on my own dataset. #22

Closed onlyonewater closed 4 months ago

onlyonewater commented 5 months ago

Hi, @RenShuhuai-Andy, I think the TimeChat is a great work, but when I test it on my own dataset, the performance is very low, i.e., the mIoU is only 0.076, the question prompt is: You are given a video from a custom dataset. Please find the visual event described by a sentence in the video, determining its starting and ending times. The format should be: 'The event happens at the start time-end time'. For example, The event 'person turn a light on' happens in the 24.3 - 30.4 seconds. Now I will give you the textual sentence: {}. Please return its start time and end time.".format(sentence), I set the num-beams as 1 and the temperature as 1, and I do not use any Quantization technology, so could you give me some advice about how to improve the performance?

onlyonewater commented 4 months ago

I think the frame numbers of input may affect the performance since most of the Video LLMs only keep 8 frames or 96 frames as visual input, this is ok when the video duration is short, i.e., less than 30 seconds or 60 seconds. But if the video duration is long, such as more than 60 seconds, the frame numbers should be increased. Do the authors have any comments about it? @RenShuhuai-Andy

RenShuhuai-Andy commented 4 months ago

Hi, thanks for your interest. Please consider providing more information to help analyze the reasons for the low performance, e.g., the duration / domain / complexity etc. of your videos.

As for the number frames, yes I believe that the frame numbers should be increased if the video duration is long. You know that Gemini 1.5 uses 1 fps, which has more than 96 frames if the video is longer than 96 seconds.

If you want to use more frames in TimeChat, you can change max_frame_pos, n_frms, and num_frm from 96 to any number larger than 96, then directly conduct inference. However, this will cause more gpu memory and I‘m not sure about the performance under this situation. We will explore a more strong TimeChat with long-context capbility in the future.

onlyonewater commented 4 months ago

thanks for your responses, the video duration in my dataset is about 120 seconds on average, and my video is not the real world, so maybe there is a domain gap between the training data and my test data? you mentioned the parameters : max_frame_pos, n_frms, and num_fm, could you give some explanations about these parameters? I think they have the same meaning which means the frame numbers of input.

RenShuhuai-Andy commented 4 months ago

max_frame_pos is used to set the frame position embedding for video q-former (link1 and link2), while n_frms and num_frm specify the number of input frames.

Accordingly, you can just set max_frame_pos >= n_frms = num_frm.

onlyonewater commented 4 months ago

OK, got it, thanks!