THUDM / AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
https://llmbench.ai
Apache License 2.0
2.11k stars 145 forks source link

[Bug/Assistance] A lot of os-std tasks are impossible #151

Open rjmoss opened 1 month ago

rjmoss commented 1 month ago

Describe the bug A large number of the os-std tasks in the 7/bootstrap.json are impossible for the agents to do as the refer to a "given folder" which is at no point given to the agent.

For example in std-007-bootstrap-00087 the task is:

In this problem, you need to analyze the file system usage (in kilobytes) in a Linux operating system. You will be given a directory in the initialization phase. You need to find out the total number of kilobytes used by all files within that directory (including subdirectories). Your task is to calculate the total file size in kilobytes within the provided directory, ignoring the sizes of any folders. The answer should be an integer value.

At no point is the model provided with the directory as it (correctly) isn't shown the init script so it is not a reasonable task. This is a single example but there are a very large number in this file. I have not gone through them all but so far this is the list I have of tasks which refer to a "given folder" which they then do not provide:

std-007-bootstrap-00087 std-007-bootstrap-00086 std-007-bootstrap-00084 std-007-bootstrap-00083 std-007-bootstrap-00081 std-007-bootstrap-00080 std-007-bootstrap-00078 std-007-bootstrap-00077 std-007-bootstrap-00075 std-007-bootstrap-00074 std-007-bootstrap-00073 std-007-bootstrap-00071 std-007-bootstrap-00070 std-007-bootstrap-00069 std-007-bootstrap-00068 std-007-bootstrap-00067 std-007-bootstrap-00065 std-007-bootstrap-00064 std-007-bootstrap-00062 std-007-bootstrap-00060 std-007-bootstrap-00059 std-007-bootstrap-00058 std-007-bootstrap-00056 std-007-bootstrap-00055 std-007-bootstrap-00054 std-007-bootstrap-00051 std-007-bootstrap-00050 std-007-bootstrap-00049 std-007-bootstrap-00044

This is not a definitive list, as. you can see I'm going through in (reverse) index order and I'm sure there are more which have this same problem.

Question

Please could you explain to me what these tasks are supposed to be measuring? Have I misunderstood the tasks?

Thank you!