ShishirPatil / gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
https://gorilla.cs.berkeley.edu/
Apache License 2.0
11.51k stars 1.01k forks source link

[BFCL] Mismatch between local evaluation and leaderboard numbers #773

Closed sileix closed 3 days ago

sileix commented 6 days ago

Describe the issue Local evaluation on A100 has a lower accuracy than the numbers reported on Leaderboard.

ID datapoint Gorilla repo commit #: 5a42197

We followed the instruction to set up the environment and API keys to evaluate with sglang. We have evaluated several models. The numbers we obtained locally are consistently lower than the numbers reported on the official leader board.

Some examples: Model Local Evaluation Acc Leaderboard Acc Diff
Hammer2.0-1.5B (FC) 49.3% 51.59% -2.29%
Qwen2.5-1.5B-Instruct (Prompt) 46.61% 48.82% -1.98%
Qwen2-1.5B-Instruct (Prompt) 29.31% 32.08% -2.77%
xLAM-1b-fc-r (FC) 24.58% 25.14% -0.56%

Is this expected? What could be the potential reason for the difference? Thanks in advance.

HuanzhiMao commented 6 days ago

Hi @sileix, Could you provide the detailed score breakdown for your local evaluation result (eg, the data_overall.csv)?

sileix commented 4 days ago

Thanks for the quick response. Please see below:

Rank,Overall Acc,Model,Model Link,Cost ($ Per 1k Function Calls),Latency Mean (s),Latency Standard Deviation (s),Latency 95th Percentile (s),Non-Live AST Acc,Non-Live Simple AST,Non-Live Multiple AST,Non-Live Parallel AST,Non-Live Parallel Multiple AST,Non-Live Exec Acc,Non-Live Simple Exec,Non-Live Multiple Exec,Non-Live Parallel Exec,Non-Live Parallel Multiple Exec,Live Acc,Live Simple AST,Live Multiple AST,Live Parallel AST,Live Parallel Multiple AST,Multi Turn Acc,Multi Turn Base,Multi Turn Miss Func,Multi Turn Miss Param,Multi Turn Long Context,Multi Turn Composite,Relevance Detection,Irrelevance Detection,Organization,License
1,49.30%,Hammer2.0-1.5b (FC),https://huggingface.co/MadeAgents/Hammer2.0-1.5b,N/A,N/A,N/A,N/A,84.29%,75.17%,92.00%,88.00%,82.00%,87.07%,93.29%,92.00%,88.00%,75.00%,62.86%,69.38%,68.08%,56.25%,70.83%,1.38%,3.00%,0.50%,1.00%,1.00%,N/A,92.68%,60.38%,MadeAgents,cc-by-nc-4.0
4,46.61%,Qwen2.5-1.5B-Instruct (Prompt),https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct,N/A,N/A,N/A,N/A,75.81%,72.25%,85.50%,75.50%,70.00%,83.29%,75.14%,92.00%,86.00%,80.00%,60.06%,66.28%,59.31%,50.00%,50.00%,1.88%,2.50%,2.00%,1.50%,1.50%,N/A,73.17%,61.78%,Qwen,apache-2.0
5,29.31%,Qwen2-1.5B-Instruct (Prompt),https://huggingface.co/Qwen/Qwen2-1.5B-Instruct,N/A,N/A,N/A,N/A,53.67%,50.67%,77.50%,45.50%,41.00%,55.66%,45.64%,76.00%,56.00%,45.00%,37.85%,44.96%,36.26%,18.75%,25.00%,0.25%,0.50%,0.00%,0.50%,0.00%,N/A,78.05%,23.85%,Qwen,apache-2.0
6,24.58%,xLAM-1b-fc-r (FC),https://huggingface.co/Salesforce/xLAM-1b-fc-r,N/A,N/A,N/A,N/A,40.92%,75.17%,86.50%,1.00%,1.00%,40.55%,68.21%,88.00%,6.00%,0.00%,37.41%,61.24%,54.68%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,N/A,97.56%,5.02%,Salesforce,cc-by-nc-4.0
HuanzhiMao commented 4 days ago

Hi, I think the numbers are expected. Looking at the most recent leaderboard scores we have in #748, Hammer2.0-1.5B (FC) has 49.68% while you reported 49.30%. Maybe you are looking at an old score before we update the leaderboard?

sileix commented 4 days ago

@HuanzhiMao Thank you for the update. It seems the numbers on the leaderboard updated in last few days. After the update all my numbers are within 0.5% difference except for Qwen2-1.5B-Instruct (Prompt) which is still 1.73% lower. Is there any particular update that leads to a global drop in accuracy number on the leaderboard?

HuanzhiMao commented 4 days ago

As you can see in the PR description, that leaderboard update contains the effect of 12 PRs; some lead to score boost, some lead to score decrease. I think #733 might result in the biggest drop, as it introduced new evaluation metrics.

sileix commented 3 days ago

I see. Thanks again for the detailed info! @HuanzhiMao