Great work for a holistic generated video metric!

teowu commented 10 months ago

Hi Dear EvalCrafter team:

It is very happy to see that our recent work, DOVER, is included for generated video evaluation, as one of the sub-category metrics.

After reading the paper, I notice that the VQA-A and VQA-T scores (from DOVER) are pretty skewed (the first one is usually near to 1 in [0,1], while the second one is usually near 0). As the author of the evaluation code, I wonder that whether this comes from the alignment process that the original scores are normalized with statistics from natural videos, which is a bit different from the statistics from the generated video.

As quality is a rather "relative" thing, would it be better if the mean and standard values be re-calculated with the statistics of generated videos, and replace those from natural videos as used int, a = (results[1] - 0.1107) / 0.07355, (results[0] + 0.08285) / 0.03774 (https://github.com/VQAssessment/DOVER/blob/master/evaluate_a_set_of_videos.py#L28)

Best Haoning Wu

vinthony commented 10 months ago

Cool, thanks for this information. We will refine it in the later version : )

teowu commented 10 months ago

Thank you vinthony. This will be great.

evalcrafter / EvalCrafter

Great work for a holistic generated video metric! #1