THUNLP-MT / StableToolBench

A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.
https://zhichengg.github.io/stb.github.io/
Apache License 2.0
79 stars 11 forks source link

Reproduce experimental results. #7

Open Taeyoung-Jang opened 2 months ago

Taeyoung-Jang commented 2 months ago

Thank you for your great work!

I just ran the evaluation pipeline and checked the pass rates for toolllama v2, gpt3.5-turbo, and gpt4-turbo. However, all the pass rates are significantly lower than the scores presented in the experiment.

I have confirmed that gpt4-turbo is being used both on the server and during the evaluation process. Are there any considerations that should be taken into account during the inference process to obtain results?

I am curious if there are any hyperparameters used to achieve results similar to those obtained in the experiment. (I think there can be an error margin of up to 5% in reproducing the experiment.)

zhichengg commented 2 months ago

Hi, Thank you for your interest in this work.

We are experiencing two issues currently that may cause the reproducibility problem: