TianxingWu / FreeInit

[ECCV 2024] FreeInit: Bridging Initialization Gap in Video Diffusion Models
https://tianxingwu.github.io/pages/FreeInit/
MIT License
454 stars 24 forks source link

Evaluation Metrics #16

Open XuweiyiChen opened 5 months ago

XuweiyiChen commented 5 months ago

Hi, thanks for the interesting work. We ran Dino's last embedding and compare similarity scores for the original AnimateDiff generated videos and found the similarities are in fact very high. Is it possible to share how to calculate the metric for visual consistency?

TianxingWu commented 5 months ago

@XuweiyiChen Thanks for your attention. As stated in our paper, for each video sample, we extract the ViT-S/16 DINO features of each frame and compute the average cosine similarity between the normalized features of the first frame and all succeeding (N − 1) frames. The similarity score can be affected by the prompt and random seed (e.g., on UCF-101 prompts, the scores of original AnimateDiff samples range from 64.50 to 95.74, with an average of 85.24). For fare comparison, we use the same random seeds for each method. Also, different AnimateDiff versions and motion modules exhibit very different temporal consistency patterns. Our experiments are conducted on AnimateDiff v1 with mm-sd-v14 motion module. If you get much higher similarity scores, please check whether the evaluated model versions are aligned with our experiment, and also test with different prompts and random seeds.

XuweiyiChen commented 5 months ago

Thanks for your swift answer. I am assuming that we only use the last embedding for similarity and there is no negative prompt?

XuweiyiChen commented 5 months ago

@TianxingWu Is possible to provide the evaluation function when calculating the DINO similarity? Many thanks in advance.