RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
267 stars 23 forks source link

Can this model do qa tasks? #26

Closed leexinhao closed 3 months ago

leexinhao commented 4 months ago

I find that the model seems to have a hard time output options, my prompt is:

Detect and report the timestamp of the video frame that semantically matches the given textual query "puts the bread on the plate ". 
Choose the correct option to the following question: 
(0) 33.59402815508409, (1) 37.721428155084084, (2) 39.995228155084085, (3) 50.98892815508409, (4) 55.96002815508409, 
Best Option: (
RenShuhuai-Andy commented 4 months ago

Hi, thanks for your interest.

I think you should do some prompt engineering, like:

text_input = "In the context of a culinary event where someone 'puts the bread on the plate', determine the closest time from the following options:"\
"(A) 33.5 seconds; (B) 37.7 seconds; (C) 39.9 seconds; (D) 50.9 seconds; (E) 55.9 seconds."\
"Please select the correct option by responding with the corresponding letter (A)-(E). Avoid outputting anything other than the designated letters."

image

The output is:

The closest time is (B) 37.7 seconds.
RenShuhuai-Andy commented 3 months ago

@leexinhao Hi, we found that reduce lora_alpha from 32 to 20 during inference can be very helpful for qa tasks. See https://github.com/RenShuhuai-Andy/TimeChat/blob/master/docs/FAQ.md#3-how-to-better-instruct-the-model-to-perform-qa-or-other-specialized-tasks