Closed DachengLi1 closed 3 months ago
Hi @DachengLi1 ,
Thanks for your interest. We will release the automatic leaderboard later, thanks for the reminder! If I remember correctly, then most of the models achieve similar rankings as the human ranking. There were a few disagreements and those are mostly due to lack of physical commonsense understanding in the video-language models and lack of more diverse high-quality physical commonsense judgment data. I would point out that the recipe to get the semantic adherence and physical commonsense scores from the videocon-physics model are already in the github repo. The generated videos from different models are also shared in the repo.
Personally, i would recommend performing human evaluation of the final model to get an accurate estimate of the model's performance. Feel free to use the automatic evaluator for faster and cheaper model iteration. I would be curious to know the results!
Hi there! Thanks a lot for the great work! I think the leaderboard is now obtained by human annotation. If I want to release a new model and use your benchmark, then I will need to use the automatic (model) evaluation. Thus, I would like to compare with other models with the automatic evaluated score. I am curious whether you will release such version of the leaderboard? Thanks!