Closed sileix closed 3 days ago
Hi @sileix,
Could you provide the detailed score breakdown for your local evaluation result (eg, the data_overall.csv
)?
Thanks for the quick response. Please see below:
Rank,Overall Acc,Model,Model Link,Cost ($ Per 1k Function Calls),Latency Mean (s),Latency Standard Deviation (s),Latency 95th Percentile (s),Non-Live AST Acc,Non-Live Simple AST,Non-Live Multiple AST,Non-Live Parallel AST,Non-Live Parallel Multiple AST,Non-Live Exec Acc,Non-Live Simple Exec,Non-Live Multiple Exec,Non-Live Parallel Exec,Non-Live Parallel Multiple Exec,Live Acc,Live Simple AST,Live Multiple AST,Live Parallel AST,Live Parallel Multiple AST,Multi Turn Acc,Multi Turn Base,Multi Turn Miss Func,Multi Turn Miss Param,Multi Turn Long Context,Multi Turn Composite,Relevance Detection,Irrelevance Detection,Organization,License
1,49.30%,Hammer2.0-1.5b (FC),https://huggingface.co/MadeAgents/Hammer2.0-1.5b,N/A,N/A,N/A,N/A,84.29%,75.17%,92.00%,88.00%,82.00%,87.07%,93.29%,92.00%,88.00%,75.00%,62.86%,69.38%,68.08%,56.25%,70.83%,1.38%,3.00%,0.50%,1.00%,1.00%,N/A,92.68%,60.38%,MadeAgents,cc-by-nc-4.0
4,46.61%,Qwen2.5-1.5B-Instruct (Prompt),https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct,N/A,N/A,N/A,N/A,75.81%,72.25%,85.50%,75.50%,70.00%,83.29%,75.14%,92.00%,86.00%,80.00%,60.06%,66.28%,59.31%,50.00%,50.00%,1.88%,2.50%,2.00%,1.50%,1.50%,N/A,73.17%,61.78%,Qwen,apache-2.0
5,29.31%,Qwen2-1.5B-Instruct (Prompt),https://huggingface.co/Qwen/Qwen2-1.5B-Instruct,N/A,N/A,N/A,N/A,53.67%,50.67%,77.50%,45.50%,41.00%,55.66%,45.64%,76.00%,56.00%,45.00%,37.85%,44.96%,36.26%,18.75%,25.00%,0.25%,0.50%,0.00%,0.50%,0.00%,N/A,78.05%,23.85%,Qwen,apache-2.0
6,24.58%,xLAM-1b-fc-r (FC),https://huggingface.co/Salesforce/xLAM-1b-fc-r,N/A,N/A,N/A,N/A,40.92%,75.17%,86.50%,1.00%,1.00%,40.55%,68.21%,88.00%,6.00%,0.00%,37.41%,61.24%,54.68%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,0.00%,N/A,97.56%,5.02%,Salesforce,cc-by-nc-4.0
Hi,
I think the numbers are expected. Looking at the most recent leaderboard scores we have in #748, Hammer2.0-1.5B (FC)
has 49.68% while you reported 49.30%. Maybe you are looking at an old score before we update the leaderboard?
@HuanzhiMao Thank you for the update. It seems the numbers on the leaderboard updated in last few days. After the update all my numbers are within 0.5% difference except for Qwen2-1.5B-Instruct (Prompt) which is still 1.73% lower. Is there any particular update that leads to a global drop in accuracy number on the leaderboard?
As you can see in the PR description, that leaderboard update contains the effect of 12 PRs; some lead to score boost, some lead to score decrease. I think #733 might result in the biggest drop, as it introduced new evaluation metrics.
I see. Thanks again for the detailed info! @HuanzhiMao
Describe the issue Local evaluation on A100 has a lower accuracy than the numbers reported on Leaderboard.
ID datapoint Gorilla repo commit #: 5a42197
We followed the instruction to set up the environment and API keys to evaluate with sglang. We have evaluated several models. The numbers we obtained locally are consistently lower than the numbers reported on the official leader board.
Is this expected? What could be the potential reason for the difference? Thanks in advance.