RenShuhuai-Andy / TimeChat

[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
https://arxiv.org/abs/2312.02051
BSD 3-Clause "New" or "Revised" License
267 stars 23 forks source link

Question about prompt #20

Closed Ironieser closed 4 months ago

Ironieser commented 5 months ago

First of all, thanks for your great work.

I met some questions when I tried to input some videos.

For example:

Input video: A 20s video from the COIN training set. A man demonstrated how to open a champagne bottle to the audience. VideoID: FOrirVpf4rY (from youtube)

Input: Capture and describe the activity events in the given video, specifying their respective time intervals, and outputting the time intervals in the 'start - end seconds format'

Output: 0.0 - 47.8 seconds, A man is cooking and grilling food in the backyard. 47.8 - 136.9 seconds, He shows the food and rubs the food and then cuts the food up. 136.9 - 212.3 seconds, He sticks the meat in the grill and cooks it, ending with a plate of meat.

Then, I tried other prompts from the paper appendix or other videos from COIN, such as 0PxSveUA9mg, a 2-minute video that teaches how to open the bottle. Unfortunately, I failed to obtain a better performance.

Any suggestions or guidance you can provide would be greatly appreciated. Thank you again for your excellent work!

RenShuhuai-Andy commented 5 months ago

Hi, thanks for your interest.

I have tried the case of COIN/videos_ali/105/FOrirVpf4rY.mp4 with the timechat_7b.pth ckpt.

  1. When the prompt is "Localize a series of activity events in the video, output the start and end timestamp for each event, and describe each event with sentences. The output format of each predicted event should be like: ' start - end seconds, event description '. An specific example is : ' 90.0 - 102.0 seconds, spread margarine on two slices of white bread in the video '. ", the outputs are:
    • 1st run: 0.0 - 28.6 seconds, A person turns on a faucet and opens a door with a bottle in it. 30.4 - 64.6 seconds, The person opens the bottle and pours a drink.
    • 2nd run: 0.0 - 26.9 seconds, A person turns on the water. 27.8 - 61.5 seconds, A person opens a bottle of wine and turns on the water. 62.4 - 64.2 seconds, A person pours water on the bottle.

  2. When the prompt is "Capture and describe the activity events in the given video, specifying their respective time intervals, and outputting the time intervals in the 'start - end seconds format'", the outputs are:
    • 1st run: 0.0 - 16.9 seconds, A person is shown in a kitchen and opens up a cabinet. 16.8 - 37.3 seconds, He turns on the faucet and begins to wash off the dishes. 37.0 - 56.8 seconds, He takes a bottle of wine and puts it in the washer.
    • 2nd run: 0.0 - 12.4 seconds, A person is seen holding up a bottle of wine and turns the lid off and opens it. 12.4 - 27.3 seconds, The man then turns the bottle to open it and pauses while he opens the wine. 27.3 - 62.9 seconds, He pours the wine into a glass as it's shown on the screen and pours the wine.

In my case, the output will not show something completely unrelated to the video, such as grilling food in the backyard you mentioned...

I've noticed a few other people telling me that the timechat_7b.pth ckpt 's zero-shot performance is not as good as expected, which is weird and I'll be checking this out soon.

Ironieser commented 5 months ago

Thank you for your help! I found there may be some differences between your input video with mine, although they all are FOrirVpf4rY.mp4 .

Referring to the original annotation of FOrirVpf4rY, I use the video clip which is sampled from [strat, end] and contains 24s rather than the untrimmed original video. These steps follow the previous work because the step annotations are based on trimmed video clips. But I guess that you may be testing the original 64-second original video.

"FOrirVpf4rY": {
    "recipe_type": 105,
    "annotation": [
        {
            "id": "180",
            "segment": [
                8.0,
                14.0
            ],
            "label": "turn the suspension link"
        },
        {
            "id": "181",
            "segment": [
                21.0,
                25.0
            ],
            "label": "open the bottle carefully"
        }
    ],
    "video_url": "https://www.youtube.com/embed/FOrirVpf4rY",
    "start": 3.9482121130914445,
    "end": 29.261808339128713,
    "duration": 64.853333,
    "class": "OpenChampagneBottle",
    "subset": "training"
},

On the other hand, I retried these two prompts you provided, the results are shown below.

Let me try to supply more detailed information to help you address my issue. Thank you for your help.

Video: The trimmed video clip: FOrirVpf4rY

https://github.com/RenShuhuai-Andy/TimeChat/assets/32543247/c23fc935-df4b-43c5-ab32-6e35dff6d145

video, _ = load_video(
    video_path=my_video_path,
    n_frms=32,  # I tried 32 and 96, but all of them did not perform well.
    sampling ="uniform", return_msg = True
)

img_list = []
chat_state = conv_llava_llama_2.copy()
chat_state.system =  "You are able to understand the visual content that the user provides. Follow the instructions carefully and explain your answers in detail."
msg = chat.upload_video_without_audio(
    video_path=args.video_path, 
    conv=chat_state,
    img_list=img_list, 
    n_frms=32,  # I tried to  adjust n_frms to 96 too.
)
text_input = "Localize a series of activity events in the video, output the start and end timestamp for each event, and describe each event with sentences. The output format of each predicted event should be like: ' start - end seconds, event description '. "
# text_input = "Capture and describe the activity events in the given video, specifying their respective time intervals, and outputting the time intervals in the 'start - end seconds format' "
print(text_input)

chat.ask(text_input, chat_state)

num_beams = args.num_beams
temperature = args.temperature
llm_message = chat.answer(conv=chat_state,
                          img_list=img_list,
                          num_beams=num_beams,
                          temperature=temperature,
                          max_new_tokens=300,
                          max_length=2000)[0]

print(llm_message)
Ironieser commented 5 months ago

Additionally, I have found the original instruction tuning data for 'FOrirVpf4rY', which is

    {
        "video": "COIN/videos_ali/105/FOrirVpf4rY.mp4",
        "QA": [
            {
                "q": "Identify and localize a series of steps or actions occurring in the video, providing start and end timestamps and related descriptions.",
                "a": "8.0 - 14.0 seconds, turn the suspension link.  21.0 - 25.0 seconds, open the bottle carefully."
            }
        ]
    },

Unfortunately, the output as below shown,

RenShuhuai-Andy commented 4 months ago

Hi @Ironieser, thanks for your information.

I downloaded your FOrirVpf4rY_24s.mp4 video and tested the output of the timechat_7b.pth ckpt.

The visualization of 32 frames is: image

The results are:

Please note that the default input video is set to examples/hotdog.mp4: image I wonder if you correctly set the input video, e.g., args.video_path = 'FOrirVpf4rY_24s.mp4'?

RenShuhuai-Andy commented 4 months ago

Referring to the original annotation of FOrirVpf4rY, I use the video clip which is sampled from [strat, end] and contains 24s rather than the untrimmed original video. These steps follow the previous work because the step annotations are based on trimmed video clips. But I guess that you may be testing the original 64-second original video.

Yes, we use the original 64-second video (for COIN). I believe that the timestamp annotations are for the untrimmed video? see https://www.youtube.com/embed/FOrirVpf4rY, the timestamps below can match the video content.

{
    "video": "COIN/videos_ali/105/FOrirVpf4rY.mp4",
    "QA": [
        {
            "q": "Identify and localize a series of steps or actions occurring in the video, providing start and end timestamps and related descriptions.",
            "a": "8.0 - 14.0 seconds, turn the suspension link.  21.0 - 25.0 seconds, open the bottle carefully."
        }
    ]
},