Results using model evaluation

Hi @DachengLi1 ,

Thanks for your interest. We will release the automatic leaderboard later, thanks for the reminder! If I remember correctly, then most of the models achieve similar rankings as the human ranking. There were a few disagreements and those are mostly due to lack of physical commonsense understanding in the video-language models and lack of more diverse high-quality physical commonsense judgment data. I would point out that the recipe to get the semantic adherence and physical commonsense scores from the videocon-physics model are already in the github repo. The generated videos from different models are also shared in the repo.

Personally, i would recommend performing human evaluation of the final model to get an accurate estimate of the model's performance. Feel free to use the automatic evaluator for faster and cheaper model iteration. I would be curious to know the results!

Hritikbansal / videophy

Results using model evaluation #1