Reproduce experimental results.

THUNLP-MT / StableToolBench

A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.

Apache License 2.0

116 stars 15 forks source link

Hi, Thank you for your interest in this work.

We are experiencing two issues currently that may cause the reproducibility problem:

Firstly, the real API server maintained by the ToolBench team is faced with instability problems. Many of the calls to real APIs returned 500 as reported by other users. We are investigating this and will hopefully fix it soon. You can double check your replicated trajectories to see whether you are facing this problem.
Secondly, the OpenA updated their gpt-4-turbo models this month. With the new model as the evaluator, the performance will systematically drop. We used gpt-4-turbo-preview in our experiments but the behaviour of this model also changed a lot. We will soon update the model performance with gpt-4-turbo-2024-04-09 and publish our model inference. We are also training our own evaluator model with an open-source model to replace these closed-source models.

THUNLP-MT / StableToolBench