Why the result of temporal video grounding is always the multiple of 5?

RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

https://arxiv.org/abs/2312.02051

BSD 3-Clause "New" or "Revised" License

267 stars 23 forks source link

Why the result of temporal video grounding is always the multiple of 5? #33

Closed zhengrongz closed 3 months ago

zhengrongz commented 3 months ago

Hi! When I am trying my own videos in the demo.ipynb, the results of temporal video grounding are always the multiple of 5. e.g. 0-5 seconds, 10-15 seconds, 0-10 seconds. I can't get the exact result to the decimal point as you show. I don't change any configs in the demo.ipynb except the video_path and the prompt. The prompt I use is the "Localize the visual content described by the given textual query 'person picks up a bag' in the video, and output the start and end timestamps in seconds." the total length of my videos are average 20-30s.

RenShuhuai-Andy commented 3 months ago

Hi, can you upload your videos? You can do so by clicking the buttom of attach files

zhengrongz commented 3 months ago

https://github.com/RenShuhuai-Andy/TimeChat/assets/63049360/13c2cf0e-39fd-44c8-a074-bfe9f7bbbbb2

ok! that's one of my videos, and I want to know the happening time of some actions. e.g. "person picks up a bag", "person sweep the floor" in this video.

RenShuhuai-Andy commented 3 months ago

I have reviewed your case, and indeed, the output is typically 0-5 seconds.

I also examined the predictions on the Charades dataset, where some predicted timestamps are multiples of 5:

I speculate that this may be due to many queries in the training data corresponding to annotations like 0-5s, 5-10s, 10-15s, etc. However, I believe this does not affect the overall evaluation?

zhengrongz commented 3 months ago

ok, I got it! Thanks for your help! I thought at first that there was something wrong with my setup that was causing some weird errors in the model. One last question, do you just use the same prompt as I mentioned above in your test case for Charades dataset?

RenShuhuai-Andy commented 3 months ago

No, for the evaluation on Charades, I used the prompt from https://github.com/RenShuhuai-Andy/TimeChat/blob/master/prompts/tvg_description_zeroshot.txt.

The prompt you mentioned above is only used for your provided video.