ShishirPatil / gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
https://gorilla.cs.berkeley.edu/
Apache License 2.0
11.51k stars 1.01k forks source link

[BFCL] Leaderboard Update, 11/17/2024 #748

Closed HuanzhiMao closed 1 week ago

HuanzhiMao commented 2 weeks ago

This PR updates the leaderboard to reflect the change in score due to the following PR merge:

  1. 719

  2. 722

  3. 723

  4. 728

  5. 732

  6. 725

  7. 712

  8. 733

  9. 720

  10. 760

  11. 761

  12. 767

CharlieJCJ commented 1 week ago
![changes_heatmap](https://github.com/user-attachments/assets/ac3f1994-b6c0-4e4c-ae09-823a667562fc)
![multi_turn_acc_table_heatmap](https://github.com/user-attachments/assets/8b644f07-5c4e-4b14-9320-ee26ea507414)

DIFF of 11_9 and 10_21 versions.

Need to double-check the models that scores change a lot.

CharlieJCJ commented 1 week ago

TODO @CharlieJCJ: After #761 result generation, publish another DIFF graph.

CharlieJCJ commented 1 week ago

Updated heatmaps after human review and additions of #760 and #761

![changes_heatmap](https://github.com/user-attachments/assets/636f019b-2955-4e5b-936f-27a35d945fcc)
![multi_turn_acc_table_heatmap](https://github.com/user-attachments/assets/c2e4a391-727f-47e7-a8db-228f1b766d74)

cc @HuanzhiMao @Fanjia-Yan @ShishirPatil

CharlieJCJ commented 1 week ago

Also include non-live and live statistics here for more visibility on how gemini models' score changes due to #760 and #764

![non-live_ast_acc_table_heatmap](https://github.com/user-attachments/assets/e2c9214b-72ee-45d1-9136-fc6c08f46663)
![non-live_exec_acc_table_heatmap](https://github.com/user-attachments/assets/ef6660b0-4913-4745-997a-bfffc95b145f)
![live_acc_table_heatmap](https://github.com/user-attachments/assets/7bd4f645-9271-411e-9bc1-6e4641bdcb6a)
CharlieJCJ commented 1 week ago

And @HuanzhiMao can you update the date for the PR, since there are more recent PR that are included

CharlieJCJ commented 1 week ago

cc @HuanzhiMao @Fanjia-Yan @ShishirPatil changes_heatmap multi_turn_acc_table_heatmap non-live_ast_acc_table_heatmap non-live_ast_acc_table_heatmap non-live_exec_acc_table_heatmap