Open p1nksnow opened 3 months ago
Hi! Thank you for your interest in our work.
We are planning to publish our model inference results soon. However, OpenAI updated their gpt-4-turbo models this month. With the new model as the evaluator, the performance will systematically drop. We used gpt-4-turbo-preview in our experiments but the behaviour of this model also changed a lot. We will soon update the model performance with gpt-4-turbo-2024-04-09. We are also training our own evaluator model with an open-source model to replace these closed-source models.
I'm testing the pass rate evaluation, could you offer the reproduction data like Toolbench? Thanks for your reply