Closed guosyjlu closed 11 months ago
We are currently planning to set up a test server in the future which allows submission. But temporarily we are not planning to release the full set.
@zhc7 So currently, the script does not allow us to profile our own LLMs for comparison to the published results?
Thanks for your interest. Actually we've released all datasets in AgentBench v0.2. You may take a look at our updated README for more information.
So... How to get test set evaluation score without docker? I don't have a machine that can run docker...
Hi, thanks for your wonderful benchmark project! I wonder know how to evaluate on test set to derive the leaderboard score? Do we only allow evaluation on the dev set in the current version? If yes, is there any plan to make us have access to evaluate on test set? Thanks for your possible help!