THUDM / AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
https://llmbench.ai
Apache License 2.0
2.15k stars 150 forks source link

How to interpret the assessment results #18

Closed foamliu closed 1 year ago

foamliu commented 1 year ago

For example, if you use llama 2 70B to run the AFLWorld evaluation, the results.json generated in the outputs directory after the evaluation is as follows:

image

How to interpret this result. Is there a total of 20 test samples? In Table 3 of the Leaderboard on the homepage, GPT-4 scored 78.0 in ALFWorld. If there are only 20 samples, this score cannot be obtained, right?

zhc7 commented 1 year ago

You can refer to README.md. As shown in "Dataset summary" section, you are using dev set, but results in the paper were tested with test set.