For example, if you use llama 2 70B to run the AFLWorld evaluation, the results.json generated in the outputs directory after the evaluation is as follows:
How to interpret this result. Is there a total of 20 test samples? In Table 3 of the Leaderboard on the homepage, GPT-4 scored 78.0 in ALFWorld. If there are only 20 samples, this score cannot be obtained, right?
For example, if you use llama 2 70B to run the AFLWorld evaluation, the results.json generated in the outputs directory after the evaluation is as follows:
How to interpret this result. Is there a total of 20 test samples? In Table 3 of the Leaderboard on the homepage, GPT-4 scored 78.0 in ALFWorld. If there are only 20 samples, this score cannot be obtained, right?