YouCook2 code to generate video clips from raw videos?

hubenjm commented 1 month ago

The youcook2 data repository (http://youcook2.eecs.umich.edu/download) only provides a script to download the raw videos into a folder .../youcook2/raw_videos/. However, the entries in the youcook_filtered_v3.json file has entries like

{
        "id": "TyR6QO1pVCo_4",
        "video": "TyR6QO1pVCo_4.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "Create a compact narrative representing the video presented.\n<video>"
            },
            {
                "from": "gpt",
                "value": "pour the rice into a bowl"
            }
        ],
        "frame_count": 631,
        "fps": 29.97002997002997
}

and in data_mixtures.py, the definition of the youcook2 mixture has videos files referenced from the directory video_data_clipped.

Could you provide details on how you generated the clipped videos or provide the script used to do it? I'm guessing it was done by reading the youcookii_annotations_trainval.json file and using ffmpeg to split each raw video into the corresponding clip, but any confirmation/details would be helpful.

XueFuzhao commented 1 month ago

Yes, exactly! You can use the annotation file and ffempg to clip the video into smaller clips.

lucasjinreal commented 1 month ago

Does VILA randomly sample from frames and send to vit?

Does they using directly 631 frames to training?

XueFuzhao commented 1 month ago

Hi we uniformly sample 8 frames for each video clip.

lucasjinreal commented 1 month ago

@XueFuzhao is it evenly resampling for 8 out of 631 in above examples? How does the multiple images send into s2-siglip? thanks for the indications.

Efficient-Large-Model / VILA

YouCook2 code to generate video clips from raw videos? #61