THUNLP-MT / StableToolBench

A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.
https://zhichengg.github.io/stb.github.io/
Apache License 2.0
112 stars 15 forks source link

Could you release the reproduction data for your result #5

Open p1nksnow opened 7 months ago

p1nksnow commented 7 months ago

I'm testing the pass rate evaluation, could you offer the reproduction data like Toolbench? Thanks for your reply

zhichengg commented 6 months ago

Hi! Thank you for your interest in our work.

We are planning to publish our model inference results soon. However, OpenAI updated their gpt-4-turbo models this month. With the new model as the evaluator, the performance will systematically drop. We used gpt-4-turbo-preview in our experiments but the behaviour of this model also changed a lot. We will soon update the model performance with gpt-4-turbo-2024-04-09. We are also training our own evaluator model with an open-source model to replace these closed-source models.

importpandas commented 3 months ago

Hi, thanks for your great job of StableToolBench. Is there any update on the release plan of model inference results?

I'm working on StableToolBench to build benchmark with other evaluation metrics. However, it's expensive to rerun all the model inference results. While the evaluation setup may change, is it possible to release the inference results first? It seems that the inference results will be always consistent during the whole evaluation process.