Question about UCF-101/MSRVTT evaluation in paper

Doubiiu / DynamiCrafter

[ECCV 2024, Oral] DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

https://doubiiu.github.io/projects/DynamiCrafter/

Apache License 2.0

2.46k stars 197 forks source link

Question about UCF-101/MSRVTT evaluation in paper #6

Closed wren93 closed 9 months ago

wren93 commented 9 months ago

Hi,

Thank you for sharing this great work. I'm interested in how you performed the evaluation on UCF-101 and MSR-VTT in the paper, in particular, how did you select the first frame to let the model condition on when generating the video and how did you select the real videos to compute FVD? My understanding is that we could randomly select a 16-frame video clip from the test videos and use this as the real video. dynamicrafter then generates a video based on the first frame of the selected video clip. This generated video is then compared against the real video. Is this the correct understanding? Thanks in advance.

Doubiiu commented 9 months ago

Hi Thanks for your interest. It is correct. It is worth noting that we should use frame stride s=3 to sample the 16-frame clips and use it as condition for image-to-video generation using DynamiCrafter. s=3 approximately reflects a general motion speed (neither too fast nor too slow).

wren93 commented 9 months ago

thanks for answering! May I also ask how many videos did you generate to evaluate on UCF-101 and MSR-VTT? Is it 10k and 2990?

Doubiiu commented 9 months ago

We use 2048 samples for both datasets, as what we have mentioned in section B.1 in the Appendix.

wren93 commented 9 months ago

I see, thanks!

hiteshK03 commented 8 months ago

Hi, following on the above discussion, can you tell how you selected the 2048 samples for both the datasets? Because on calculating FVD for the entire dataset of MSR-VTT i.e. on 2990 videos, I got a score of 328 which is more than the reported value. Therefore, I was curious to know, if I am doing something wrong here.

Thanks.