MeetKai / functionary

Chat language model that can use tools and interpret the results
MIT License
1.37k stars 107 forks source link

Suggestion: benchmark on toolsandbox #244

Closed themrzmaster closed 1 month ago

themrzmaster commented 1 month ago

This looks like a nice benchmark by Apple: https://arxiv.org/pdf/2408.04682 It shows a big gap between OS and proprietary models. I guess functionary should perform good, by my usage experience

jeffrey-fong commented 1 month ago

This looks like a really good evaluation benchmark set! We will look into this once they release the GitHub repository. Thank you for suggesting this!

themrzmaster commented 1 month ago

Repo is on https://github.com/apple/ToolSandbox

jeffrey-fong commented 1 month ago

@themrzmaster our team is working on it currently. Hope to share some good news soon! Thank you for notifying.

jeffrey-fong commented 1 month ago

Our preliminary results indicate that our models have very similar scores as some of the proprietary models! 🎉

Will be updating the README with the scores soon once we confirm everything is correct.

jeffreymeetkai commented 1 month ago

The results are released 🚀