Closed HuanzhiMao closed 1 week ago
![changes_heatmap](https://github.com/user-attachments/assets/ac3f1994-b6c0-4e4c-ae09-823a667562fc)
![multi_turn_acc_table_heatmap](https://github.com/user-attachments/assets/8b644f07-5c4e-4b14-9320-ee26ea507414)
DIFF of 11_9 and 10_21 versions.
Need to double-check the models that scores change a lot.
TODO @CharlieJCJ: After #761 result generation, publish another DIFF graph.
Updated heatmaps after human review and additions of #760 and #761
![changes_heatmap](https://github.com/user-attachments/assets/636f019b-2955-4e5b-936f-27a35d945fcc)
![multi_turn_acc_table_heatmap](https://github.com/user-attachments/assets/c2e4a391-727f-47e7-a8db-228f1b766d74)
cc @HuanzhiMao @Fanjia-Yan @ShishirPatil
Also include non-live and live statistics here for more visibility on how gemini models' score changes due to #760 and #764
![non-live_ast_acc_table_heatmap](https://github.com/user-attachments/assets/e2c9214b-72ee-45d1-9136-fc6c08f46663)
![non-live_exec_acc_table_heatmap](https://github.com/user-attachments/assets/ef6660b0-4913-4745-997a-bfffc95b145f)
![live_acc_table_heatmap](https://github.com/user-attachments/assets/7bd4f645-9271-411e-9bc1-6e4641bdcb6a)
And @HuanzhiMao can you update the date for the PR, since there are more recent PR that are included
cc @HuanzhiMao @Fanjia-Yan @ShishirPatil
This PR updates the leaderboard to reflect the change in score due to the following PR merge:
719
722
723
728
732
725
712
733
720
760
761
767