Hritikbansal / videophy

Video Generation, Physical Commonsense, Semantic Adherence, VideoCon-Physics
MIT License
33 stars 3 forks source link

Results using model evaluation #1

Closed DachengLi1 closed 3 months ago

DachengLi1 commented 3 months ago

Hi there! Thanks a lot for the great work! I think the leaderboard is now obtained by human annotation. If I want to release a new model and use your benchmark, then I will need to use the automatic (model) evaluation. Thus, I would like to compare with other models with the automatic evaluated score. I am curious whether you will release such version of the leaderboard? Thanks!

Hritikbansal commented 3 months ago

Hi @DachengLi1 ,

Thanks for your interest. We will release the automatic leaderboard later, thanks for the reminder! If I remember correctly, then most of the models achieve similar rankings as the human ranking. There were a few disagreements and those are mostly due to lack of physical commonsense understanding in the video-language models and lack of more diverse high-quality physical commonsense judgment data. I would point out that the recipe to get the semantic adherence and physical commonsense scores from the videocon-physics model are already in the github repo. The generated videos from different models are also shared in the repo.

Personally, i would recommend performing human evaluation of the final model to get an accurate estimate of the model's performance. Feel free to use the automatic evaluator for faster and cheaper model iteration. I would be curious to know the results!