Vchitect / Latte

Latte: Latent Diffusion Transformer for Video Generation.
Apache License 2.0
1.62k stars 169 forks source link

FVD on UCF-101 #97

Closed yizenghan closed 1 month ago

yizenghan commented 1 month ago

Hi, I'm still confused about how to evaluate the FVD on UCF101. Now I have a folder generated by sample/sample_ddp.py: test/0000.mp4~0015.mp4. My UCF dataset is a folder downloaded from the official website, and each of its subfold contains multiple .avi files. Training is good with this structure.

Now if I run tools/eval_metrics.sh, filling the paths with the follows: --real_data_path path/to/UCF-101 # subfolder/video_gxx_cxx.avi --fake_data_path path/to/test # generated_video.mp4

It comes out an error: assert curr_obj_depth == root_path_depth + 1, f"Video directories should be inside the root dir. {o} is not." ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Video directories should be inside the root dir. UCF-101/ApplyEyeMakeup/v_ApplyEyeMakeup_g01_c01.avi is not.

It seems that the data should not be organized like this.

maxin-cn commented 1 month ago

To measure FVD, we need to put all the videos from real data into a folder and turn them into video frames (the same goes for fake data).

yizenghan commented 1 month ago

Thanks for replying. I'm wondering if there is a conversion code.

maxin-cn commented 1 month ago

Please try this (https://github.com/Vchitect/Latte/blob/main/tools/convert_videos_to_frames.py).

yizenghan commented 1 month ago

Thanks, I have figured out how to test the FVD. However, I find that the provided pre-trained UCF model yields an FVD of ~202. The result reported in the paper is ~333. Is this normal? If yes, does this suggest that 2048 is insufficient for a stable FVD result?

Note: I sampled 2048 videos as suggested in the paper. My seed is randomly set as 2330.

maxin-cn commented 1 month ago

I think it should be normal. The model I released should not be the model tested in the paper.

yizenghan commented 1 month ago

Thanks, your kind and timely response is appreciated a lot.

yizenghan commented 1 month ago

Hi, I have another small question on the frame conversion and FVD testing. I read from the paper that the FVD is tested by analyzing 2,048 video clips, each comprising 16 frames. Before testing FVD, I converted the dataset video to frames with the following command:

python tools/convert_videos_to_frames.py \ -s path/to/UCF-101 \ -t path/to/UCF-101-frames --target_size 256

After this, the UCF-101-frames folder contains various sub-folders. Each sub-fold corresponds to an avi file in the original video path, e.g. v_YoYo_g25_c05. I think this part is correct. But I find that different sub-folds contain different numbers of jpg frame files (usually > 16 frames). For example, UCF-101-frames/v_YoYo_g25_c05 contains up to 196 jpg files.

Thanks for your time!

maxin-cn commented 1 month ago

Do you mean why the number of frames in each video is different? This should be a characteristic of the UCF-101 dataset itself.

yizenghan commented 1 month ago

Yes, so this operation is correct. Then how can we use 16 frames for each video to test the FVD? After conversion, my testing script is

python tools/calc_metrics_for_dataset.py \ --real_data_path path/to/UCF-101-frames \ --fake_data_path generated_videos-frames\ --mirror 1 --gpus 1 --resolution 256 \ --metrics fvd2048_16f \ --gpus 8 \ --verbose True --use_cache 0 &\

Am I right?

maxin-cn commented 1 month ago

I calculated FVD using one GPU. The following is an example:

python tools/calc_metrics_for_dataset.py \
--real_data_path /path/to/real_data//images \
--fake_data_path /path/to/fake_data/images \
--mirror 1 --gpus 1 --resolution 256 \
--metrics fvd2048_16f  \
--verbose 0 --use_cache 0
yizenghan commented 1 month ago

Thanks. Is this generated image fold the same as you used for train_with_img?

If yes, I found the total sample number is much larger than using videos only. (from 13,320 videos to 2,502,480 videos). Also, I had to add "frame_data_path: "/datasets/UCF-101-frames/"" in the config yaml to avoid error. Did I do anything wrong?

maxin-cn commented 1 month ago

There seems to be nothing wrong with this. Specific data structures can be found here.

yizenghan commented 1 month ago

Thanks a lot! So you did use the images from the whole converted dataset to train the model. I'm trying this but got some trouble in dataset building (it got stuck after printing " Dataset contains 2,502,480 videos (/datasets/UCF-101/)"). That's why I suspect my image folder is wrong. Now it seems right and I'll look after the system side. Thanks for your help.

maxin-cn commented 1 month ago

You can first debug the dataset by running python datasets/ucf101_image_datasets.py.

yizenghan commented 1 month ago

Oh that would help a lot. THANKS!