Open Taeyoung-Jang opened 7 months ago
Hi, Thank you for your interest in this work.
We are experiencing two issues currently that may cause the reproducibility problem:
gpt-4-turbo
models this month. With the new model as the evaluator, the performance will systematically drop. We used gpt-4-turbo-preview
in our experiments but the behaviour of this model also changed a lot. We will soon update the model performance with gpt-4-turbo-2024-04-09
and publish our model inference. We are also training our own evaluator model with an open-source model to replace these closed-source models.
Thank you for your great work!
I just ran the evaluation pipeline and checked the pass rates for
toolllama v2
,gpt3.5-turbo
, andgpt4-turbo
. However, all the pass rates are significantly lower than the scores presented in the experiment.I have confirmed that
gpt4-turbo
is being used both on the server and during the evaluation process. Are there any considerations that should be taken into account during the inference process to obtain results?I am curious if there are any hyperparameters used to achieve results similar to those obtained in the experiment. (I think there can be an error margin of up to 5% in reproducing the experiment.)