Closed themrzmaster closed 1 month ago
This looks like a really good evaluation benchmark set! We will look into this once they release the GitHub repository. Thank you for suggesting this!
Repo is on https://github.com/apple/ToolSandbox
@themrzmaster our team is working on it currently. Hope to share some good news soon! Thank you for notifying.
Our preliminary results indicate that our models have very similar scores as some of the proprietary models! 🎉
Will be updating the README with the scores soon once we confirm everything is correct.
The results are released 🚀
This looks like a nice benchmark by Apple: https://arxiv.org/pdf/2408.04682 It shows a big gap between OS and proprietary models. I guess functionary should perform good, by my usage experience