Open XuweiyiChen opened 10 months ago
@XuweiyiChen Thanks for your attention. As stated in our paper, for each video sample, we extract the ViT-S/16 DINO features of each frame and compute the average cosine similarity between the normalized features of the first frame and all succeeding (N − 1) frames. The similarity score can be affected by the prompt and random seed (e.g., on UCF-101 prompts, the scores of original AnimateDiff samples range from 64.50 to 95.74, with an average of 85.24). For fare comparison, we use the same random seeds for each method. Also, different AnimateDiff versions and motion modules exhibit very different temporal consistency patterns. Our experiments are conducted on AnimateDiff v1 with mm-sd-v14 motion module. If you get much higher similarity scores, please check whether the evaluated model versions are aligned with our experiment, and also test with different prompts and random seeds.
Thanks for your swift answer. I am assuming that we only use the last embedding for similarity and there is no negative prompt?
@TianxingWu Is possible to provide the evaluation function when calculating the DINO similarity? Many thanks in advance.
Hi, thanks for the interesting work. We ran Dino's last embedding and compare similarity scores for the original AnimateDiff generated videos and found the similarities are in fact very high. Is it possible to share how to calculate the metric for visual consistency?