Question about prompt - Githubissues

First of all, thanks for your great work.

I met some questions when I tried to input some videos.

For example:

Input video: A 20s video from the COIN training set. A man demonstrated how to open a champagne bottle to the audience. VideoID: FOrirVpf4rY (from youtube)

Input: Capture and describe the activity events in the given video, specifying their respective time intervals, and outputting the time intervals in the 'start - end seconds format'

Output: 0.0 - 47.8 seconds, A man is cooking and grilling food in the backyard. 47.8 - 136.9 seconds, He shows the food and rubs the food and then cuts the food up. 136.9 - 212.3 seconds, He sticks the meat in the grill and cooks it, ending with a plate of meat.

Then, I tried other prompts from the paper appendix or other videos from COIN, such as 0PxSveUA9mg, a 2-minute video that teaches how to open the bottle. Unfortunately, I failed to obtain a better performance.

Any suggestions or guidance you can provide would be greatly appreciated. Thank you again for your excellent work!

Hi, thanks for your interest.

I have tried the case of COIN/videos_ali/105/FOrirVpf4rY.mp4 with the timechat_7b.pth ckpt.

When the prompt is "Localize a series of activity events in the video, output the start and end timestamp for each event, and describe each event with sentences. The output format of each predicted event should be like: ' start - end seconds, event description '. An specific example is : ' 90.0 - 102.0 seconds, spread margarine on two slices of white bread in the video '. ", the outputs are:
- 1st run: 0.0 - 28.6 seconds, A person turns on a faucet and opens a door with a bottle in it. 30.4 - 64.6 seconds, The person opens the bottle and pours a drink.
- 2nd run: 0.0 - 26.9 seconds, A person turns on the water. 27.8 - 61.5 seconds, A person opens a bottle of wine and turns on the water. 62.4 - 64.2 seconds, A person pours water on the bottle.
When the prompt is "Capture and describe the activity events in the given video, specifying their respective time intervals, and outputting the time intervals in the 'start - end seconds format'", the outputs are:
- 1st run: 0.0 - 16.9 seconds, A person is shown in a kitchen and opens up a cabinet. 16.8 - 37.3 seconds, He turns on the faucet and begins to wash off the dishes. 37.0 - 56.8 seconds, He takes a bottle of wine and puts it in the washer.
- 2nd run: 0.0 - 12.4 seconds, A person is seen holding up a bottle of wine and turns the lid off and opens it. 12.4 - 27.3 seconds, The man then turns the bottle to open it and pauses while he opens the wine. 27.3 - 62.9 seconds, He pours the wine into a glass as it's shown on the screen and pours the wine.

In my case, the output will not show something completely unrelated to the video, such as grilling food in the backyard you mentioned...

I've noticed a few other people telling me that the timechat_7b.pth ckpt 's zero-shot performance is not as good as expected, which is weird and I'll be checking this out soon.

Thank you for your help！ I found there may be some differences between your input video with mine， although they all are FOrirVpf4rY.mp4 .

Referring to the original annotation of FOrirVpf4rY, I use the video clip which is sampled from [strat, end] and contains 24s rather than the untrimmed original video. These steps follow the previous work because the step annotations are based on trimmed video clips. But I guess that you may be testing the original 64-second original video.

"FOrirVpf4rY": {
    "recipe_type": 105,
    "annotation": [
        {
            "id": "180",
            "segment": [
                8.0,
                14.0
            ],
            "label": "turn the suspension link"
        },
        {
            "id": "181",
            "segment": [
                21.0,
                25.0
            ],
            "label": "open the bottle carefully"
        }
    ],
    "video_url": "https://www.youtube.com/embed/FOrirVpf4rY",
    "start": 3.9482121130914445,
    "end": 29.261808339128713,
    "duration": 64.853333,
    "class": "OpenChampagneBottle",
    "subset": "training"
},

On the other hand, I retried these two prompts you provided, the results are shown below.

Prompt 1
- Localize a series of activity events in the video, output the start and end timestamp for each event, and describe each event with sentences. The output format of each predicted event should be like: ' start - end seconds, event description '. An specific example is : ' 90.0 - 102.0 seconds, spread margarine on two slices of white bread in the video '.
- Output: 0.0 - 54.5 seconds, A man is seen speaking to the camera while holding a spatula and leads into him putting butter on the grill and cooking a meal. 54.5 - 145.6 seconds, The man continues moving around the grill and leads into him cutting up pieces and ending with him being done. 145.6 - 214.4 seconds, He is then seen presenting the final product to the camera and speaking to it.
- Prompt 2
- Capture and describe the activity events in the given video, specifying their respective time intervals, and outputting the time intervals in the 'start - end seconds format'
- Ouput: 0.0 - 62.3 seconds, A man is seen speaking to the camera while standing near a grill with various instruments on it. 56.6 - 137.9 seconds, He then pours some oil and spices into a bowl and proceeds to speak to the camera about cooking instructions. 126.7 - 203.7 seconds, He continues cooking the meat on the grill and is shown finishing the cooking with a fork.

Let me try to supply more detailed information to help you address my issue. Thank you for your help.

Video: The trimmed video clip: FOrirVpf4rY

https://github.com/RenShuhuai-Andy/TimeChat/assets/32543247/c23fc935-df4b-43c5-ab32-6e35dff6d145

video, _ = load_video(
    video_path=my_video_path,
    n_frms=32,  # I tried 32 and 96, but all of them did not perform well.
    sampling ="uniform", return_msg = True
)

img_list = []
chat_state = conv_llava_llama_2.copy()
chat_state.system =  "You are able to understand the visual content that the user provides. Follow the instructions carefully and explain your answers in detail."
msg = chat.upload_video_without_audio(
    video_path=args.video_path, 
    conv=chat_state,
    img_list=img_list, 
    n_frms=32,  # I tried to  adjust n_frms to 96 too.
)
text_input = "Localize a series of activity events in the video, output the start and end timestamp for each event, and describe each event with sentences. The output format of each predicted event should be like: ' start - end seconds, event description '. "
# text_input = "Capture and describe the activity events in the given video, specifying their respective time intervals, and outputting the time intervals in the 'start - end seconds format' "
print(text_input)

chat.ask(text_input, chat_state)

num_beams = args.num_beams
temperature = args.temperature
llm_message = chat.answer(conv=chat_state,
                          img_list=img_list,
                          num_beams=num_beams,
                          temperature=temperature,
                          max_new_tokens=300,
                          max_length=2000)[0]

print(llm_message)

Additionally, I have found the original instruction tuning data for 'FOrirVpf4rY', which is

    {
        "video": "COIN/videos_ali/105/FOrirVpf4rY.mp4",
        "QA": [
            {
                "q": "Identify and localize a series of steps or actions occurring in the video, providing start and end timestamps and related descriptions.",
                "a": "8.0 - 14.0 seconds, turn the suspension link.  21.0 - 25.0 seconds, open the bottle carefully."
            }
        ]
    },

Unfortunately, the output as below shown,

Input : Identify and localize a series of steps or actions occurring in the video, providing start and end timestamps and related descriptions.
0.0 - 7.8 seconds, cut the sausage in half. 7.8 - 12.2 seconds, cut the pepper in half. 12.2 - 17.4 seconds, cut the tomato in half. 17.4 - 35.3 seconds, put the sauce in the oven. 35.3 - 76.1 seconds, place the sausage on the sauce. 76.1 - 105.0 seconds, sprinkle salt, pepper. 105.0 - 134.1 seconds, place the pepper on the sausage. 134.1 - 165.0 seconds, sprinkle onions on the sausage. 165.0 - 176.6 seconds, put more salt. 176.6 - 186.2 seconds, place the tomato on the sausage. 186.2 - 199.6 seconds, cover with the lid. 199.6 - 205.0 seconds, put the sauce on the sausage. 205.0 - 210.2 seconds, place the pepper on the sauce.

Hi @Ironieser, thanks for your information.

I downloaded your FOrirVpf4rY_24s.mp4 video and tested the output of the timechat_7b.pth ckpt.

The visualization of 32 frames is:

The results are:

prompt 1
- Input: Localize a series of activity events in the video, output the start and end timestamp for each event, and describe each event with sentences. The output format of each predicted event should be like: ' start - end seconds, event description '. An specific example is : ' 90.0 - 102.0 seconds, spread margarine on two slices of white bread in the video '.
- Output: 0.0 - 24.2 seconds, A person turns on the water to a sink. 0.0 - 24.2 seconds, A person uncorks a bottle of wine.
- llm_message: [['USER', " The video contains 96 frames sampled at 0.0, 0.2, 0.5, 0.8, 1.0, 1.2, 1.5, 1.8, 2.0, 2.3, 2.5, 2.8, 3.0, 3.3, 3.5, 3.8, 4.0, 4.3, 4.6, 4.8, 5.0, 5.3, 5.6, 5.8, 6.1, 6.3, 6.6, 6.8, 7.1, 7.3, 7.6, 7.8, 8.1, 8.4, 8.6, 8.9, 9.1, 9.4, 9.6, 9.9, 10.1, 10.4, 10.6, 10.9, 11.2, 11.4, 11.6, 11.9, 12.2, 12.4, 12.7, 12.9, 13.2, 13.4, 13.7, 13.9, 14.2, 14.4, 14.7, 15.0, 15.2, 15.4, 15.7, 16.0, 16.2, 16.5, 16.7, 17.0, 17.2, 17.5, 17.8, 18.0, 18.2, 18.5, 18.8, 19.0, 19.3, 19.5, 19.8, 20.0, 20.3, 20.5, 20.8, 21.0, 21.3, 21.6, 21.8, 22.0, 22.3, 22.6, 22.8, 23.1, 23.3, 23.6, 23.8, 24.1 seconds. Localize a series of activity events in the video, output the start and end timestamp for each event, and describe each event with sentences. The output format of each predicted event should be like: ' start - end seconds, event description '. An specific example is : ' 90.0 - 102.0 seconds, spread margarine on two slices of white bread in the video '."]] ['0.0 - 24.2 seconds, A person turns on the water to a sink. 0.0 - 24.2 seconds, A person uncorks a bottle of wine.']
prompt 2
- Input: Capture and describe the activity events in the given video, specifying their respective time intervals, and outputting the time intervals in the 'start - end seconds format'
- Output: 0.0 - 13.5 seconds, A man is shown turning a bottle over a sink. 13.4 - 23.4 seconds, He turns it over on a counter and opens the bottle.
- llm_message: [['USER', " The video contains 96 frames sampled at 0.0, 0.2, 0.5, 0.8, 1.0, 1.2, 1.5, 1.8, 2.0, 2.3, 2.5, 2.8, 3.0, 3.3, 3.5, 3.8, 4.0, 4.3, 4.6, 4.8, 5.0, 5.3, 5.6, 5.8, 6.1, 6.3, 6.6, 6.8, 7.1, 7.3, 7.6, 7.8, 8.1, 8.4, 8.6, 8.9, 9.1, 9.4, 9.6, 9.9, 10.1, 10.4, 10.6, 10.9, 11.2, 11.4, 11.6, 11.9, 12.2, 12.4, 12.7, 12.9, 13.2, 13.4, 13.7, 13.9, 14.2, 14.4, 14.7, 15.0, 15.2, 15.4, 15.7, 16.0, 16.2, 16.5, 16.7, 17.0, 17.2, 17.5, 17.8, 18.0, 18.2, 18.5, 18.8, 19.0, 19.3, 19.5, 19.8, 20.0, 20.3, 20.5, 20.8, 21.0, 21.3, 21.6, 21.8, 22.0, 22.3, 22.6, 22.8, 23.1, 23.3, 23.6, 23.8, 24.1 seconds. Capture and describe the activity events in the given video, specifying their respective time intervals, and outputting the time intervals in the 'start - end seconds format'"]] ['0.0 - 13.5 seconds, A man is shown turning a bottle over a sink. 13.4 - 23.4 seconds, He turns it over on a counter and opens the bottle.']
prompt 3
- Input: Identify and localize a series of steps or actions occurring in the video, providing start and end timestamps and related descriptions.
- Output: 0.0 - 4.2 seconds, Open the bottle carefully. 4.6 - 18.3 seconds, Turn the bottle caps. 18.9 - 24.2 seconds, Pull the bottle outwards.
- llm_message: [['USER', ' The video contains 96 frames sampled at 0.0, 0.2, 0.5, 0.8, 1.0, 1.2, 1.5, 1.8, 2.0, 2.3, 2.5, 2.8, 3.0, 3.3, 3.5, 3.8, 4.0, 4.3, 4.6, 4.8, 5.0, 5.3, 5.6, 5.8, 6.1, 6.3, 6.6, 6.8, 7.1, 7.3, 7.6, 7.8, 8.1, 8.4, 8.6, 8.9, 9.1, 9.4, 9.6, 9.9, 10.1, 10.4, 10.6, 10.9, 11.2, 11.4, 11.6, 11.9, 12.2, 12.4, 12.7, 12.9, 13.2, 13.4, 13.7, 13.9, 14.2, 14.4, 14.7, 15.0, 15.2, 15.4, 15.7, 16.0, 16.2, 16.5, 16.7, 17.0, 17.2, 17.5, 17.8, 18.0, 18.2, 18.5, 18.8, 19.0, 19.3, 19.5, 19.8, 20.0, 20.3, 20.5, 20.8, 21.0, 21.3, 21.6, 21.8, 22.0, 22.3, 22.6, 22.8, 23.1, 23.3, 23.6, 23.8, 24.1 seconds. Identify and localize a series of steps or actions occurring in the video, providing start and end timestamps and related descriptions.']] ['0.0 - 4.2 seconds, Open the bottle carefully. 4.6 - 18.3 seconds, Turn the bottle caps. 18.9 - 24.2 seconds, Pull the bottle outwards.']

Please note that the default input video is set to examples/hotdog.mp4: I wonder if you correctly set the input video, e.g., args.video_path = 'FOrirVpf4rY_24s.mp4'?

Referring to the original annotation of FOrirVpf4rY, I use the video clip which is sampled from [strat, end] and contains 24s rather than the untrimmed original video. These steps follow the previous work because the step annotations are based on trimmed video clips. But I guess that you may be testing the original 64-second original video.

Yes, we use the original 64-second video (for COIN). I believe that the timestamp annotations are for the untrimmed video? see https://www.youtube.com/embed/FOrirVpf4rY, the timestamps below can match the video content.

{
    "video": "COIN/videos_ali/105/FOrirVpf4rY.mp4",
    "QA": [
        {
            "q": "Identify and localize a series of steps or actions occurring in the video, providing start and end timestamps and related descriptions.",
            "a": "8.0 - 14.0 seconds, turn the suspension link.  21.0 - 25.0 seconds, open the bottle carefully."
        }
    ]
},

RenShuhuai-Andy / TimeChat

Question about prompt #20

2nd run: `0.0 - 26.9 seconds, A person turns on the water. 27.8 - 61.5 seconds, A person opens a bottle of wine and turns on the water. 62.4 - 64.2 seconds, A person pours water on the bottle.`

RenShuhuai-Andy / TimeChat

Question about prompt #20

2nd run: 0.0 - 26.9 seconds, A person turns on the water. 27.8 - 61.5 seconds, A person opens a bottle of wine and turns on the water. 62.4 - 64.2 seconds, A person pours water on the bottle.

2nd run: `0.0 - 26.9 seconds, A person turns on the water. 27.8 - 61.5 seconds, A person opens a bottle of wine and turns on the water. 62.4 - 64.2 seconds, A person pours water on the bottle.`