THUDM / AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
https://llmbench.ai
Apache License 2.0
2.15k stars 150 forks source link

Access to Test Sets #39

Closed guosyjlu closed 11 months ago

guosyjlu commented 1 year ago

Hi, thanks for your wonderful benchmark project! I wonder know how to evaluate on test set to derive the leaderboard score? Do we only allow evaluation on the dev set in the current version? If yes, is there any plan to make us have access to evaluate on test set? Thanks for your possible help!

zhc7 commented 1 year ago

We are currently planning to set up a test server in the future which allows submission. But temporarily we are not planning to release the full set.

olivarb commented 11 months ago

@zhc7 So currently, the script does not allow us to profile our own LLMs for comparison to the published results?

zhc7 commented 11 months ago

Thanks for your interest. Actually we've released all datasets in AgentBench v0.2. You may take a look at our updated README for more information.

1049451037 commented 11 months ago

So... How to get test set evaluation score without docker? I don't have a machine that can run docker...