THUNLP-MT / StableToolBench

A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.
https://zhichengg.github.io/stb.github.io/
Apache License 2.0
81 stars 11 forks source link

Could you release the reproduction data for your result #5

Open p1nksnow opened 3 months ago

p1nksnow commented 3 months ago

I'm testing the pass rate evaluation, could you offer the reproduction data like Toolbench? Thanks for your reply

zhichengg commented 2 months ago

Hi! Thank you for your interest in our work.

We are planning to publish our model inference results soon. However, OpenAI updated their gpt-4-turbo models this month. With the new model as the evaluator, the performance will systematically drop. We used gpt-4-turbo-preview in our experiments but the behaviour of this model also changed a lot. We will soon update the model performance with gpt-4-turbo-2024-04-09. We are also training our own evaluator model with an open-source model to replace these closed-source models.