Motion quality is not representative

zeinsh commented 5 months ago

I did an analysis of the motion score produced by your evaluation script.

https://github.com/evalcrafter/EvalCrafter/blob/master/eval_from_metrics.py#L61-L63

import numpy as np
motion_weights = np.array([-0.01641512, -0.01340959, -0.10517075])*5
motion_intercept = 0.1297562020899355*5

# PikaLab V1.0
action_score=61.29/100
motion_ac_score=42/100
flow_score=1.14/100

motion_metrics = np.array([action_score, motion_ac_score, flow_score])
motion = np.dot(motion_weights, motion_metrics) + motion_intercept
motion *= 100
print(motion)
>>> 56.43220034596774

According to this script, a model that has zero action score, zero motion_ac score and zero flow_score will be the best.

There should be something wrong in your procedure.

Yaofang-Liu commented 4 months ago

Hi zeinsh, you are right! Actually, we also noticed that such problem does exist, which mainly raised from the human alignment process (one reason could be that we lack real videos for the prompts, so the human labels have some bias). We also did some analysis in the paper like in Finding #6, which can explain why lower optical scores can get better results. So far, we don't have the best solution to tackle this. For a stopgap measure, we recommend to consider averaging metrics related to motion quality like action_score and motion_ac_score.

Please feel free to reach out if you have any further questions or suggestions. Many thanks!

zeinsh commented 4 months ago

Thank you Yaofang Liu! You did amazing work and collected too many models and scores that help in understanding video generation models more in deep.

I am also thinking of how to improve these sets of metrics and make it more representative. I noticed (also mentioned in EMU video paper) that higher motion score (flow score) doesn't translate necessarily into interesting movement, as it could be undesirable jitter, or reflect poor object consistency. I am thinking of augmenting this score with canny or depth maps and calculate motion score on those maps.

Also, in my opinion, motion AC score is not a motion quality score, it is more related to text-video alignment.

Yaofang-Liu commented 4 months ago

Basically, we think the assessment of motion quality can be broadly categorized into two main aspects: the stability of the motion or video, and the accuracy with which the motion mimics real-world physics (do you agree? :-). A singular metric may not sufficiently capture these nuances.

Your suggestion to enhance the motion scoring methodology is intriguing. It is important to note that the motion score, in its current form, is a neutral metric and does not directly indicate the quality of the motion. However, it becomes more meaningful when combined with prior knowledge or human feedback (like our released data here). For instance, motion scores vary significantly between different actions, such as running or walking. Additional metrics, such as StableVQA, which measure video stability, could be integrated, potentially aligned with human judgment, to more accurately reflect video motion quality.

Furthermore, we recognize that the motion AC score correlates with text-video alignment. But it also indicates that the capability of a video model to generate significant or minor motion, which is a crucial indicator of motion quality. Notably, models like Pika and Gen2 tend to produce videos with restrained motion, which may highlight a limitation in their motion quality.

Please let me know if you have any further questions or suggestions. Many thanks for your time and insightful ideas!

zeinsh commented 4 months ago

Thank you Yaofang Liu for the detailed insightful answer!

evalcrafter / EvalCrafter

Motion quality is not representative #9