Efficient-Large-Model / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
878 stars 55 forks source link

YouCook2 code to generate video clips from raw videos? #61

Open hubenjm opened 1 month ago

hubenjm commented 1 month ago

The youcook2 data repository (http://youcook2.eecs.umich.edu/download) only provides a script to download the raw videos into a folder .../youcook2/raw_videos/. However, the entries in the youcook_filtered_v3.json file has entries like

{
        "id": "TyR6QO1pVCo_4",
        "video": "TyR6QO1pVCo_4.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "Create a compact narrative representing the video presented.\n<video>"
            },
            {
                "from": "gpt",
                "value": "pour the rice into a bowl"
            }
        ],
        "frame_count": 631,
        "fps": 29.97002997002997
}

and in data_mixtures.py, the definition of the youcook2 mixture has videos files referenced from the directory video_data_clipped.

Could you provide details on how you generated the clipped videos or provide the script used to do it? I'm guessing it was done by reading the youcookii_annotations_trainval.json file and using ffmpeg to split each raw video into the corresponding clip, but any confirmation/details would be helpful.

XueFuzhao commented 1 month ago

Yes, exactly! You can use the annotation file and ffempg to clip the video into smaller clips.

lucasjinreal commented 1 month ago

Does VILA randomly sample from frames and send to vit?

Does they using directly 631 frames to training?

XueFuzhao commented 1 month ago

Hi we uniformly sample 8 frames for each video clip.

lucasjinreal commented 1 month ago

@XueFuzhao is it evenly resampling for 8 out of 631 in above examples? How does the multiple images send into s2-siglip? thanks for the indications.