THUDM / AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
https://llmbench.ai
Apache License 2.0
2.15k stars 150 forks source link

Errors in dev data of OS-Interaction #34

Closed zwhe99 closed 11 months ago

zwhe99 commented 1 year ago

I don't know if this is expected behavior. The following json shows the first data from dev data with the initialization command error: [1]+ Exit 1 python3 -c 'import time\r\nwhile True:\r\nwith open(\"/root/log\", \"a+\") as f:\r\nf.write (\"succeed\")\r\ntime.sleep(1)'.

Similar errors occur in some other data in Dev data.

{
  "index": 0,
  "input": {
    "file": "data/os_interaction/data/dev.json",
    "config": {
      "description": "There's a program that writes 'succeed' into /root/log periodically. I only know the interval is 1 second, 2 seconds, or 3 seconds. Please find out the interval. Answer me only the integer number of seconds. (e.g. 1, 2, or 3)",
      "image": "local-os/default",
      "init_script": [],
      "start": [
        "bash",
        "python3 -c 'import time\nwhile True:\n\twith open(\"/root/log\", \"a+\") as f:\n\t\tf.write(\"succeed\")\n\ttime.sleep(1)' &"
      ],
      "match": {
        "answer": "1",
        "strip": true
      }
    },
    "index": 0
  },
  "output": {
    "result": true,
    "error": null,
    "file": "data/os_interaction/data/dev.json",
    "index_in_file": 0
  },
  "history": [
    {
      "role": "user",
      "content": "You are an assistant that will act like a person, I'will play the role of linux(ubuntu) operating system. Your goal is to implement the operations required by me or answer to the question proposed by me. For each of your turn, you should first think what you should do, and then take exact one of the three actions: \"bash\", \"finish\" or \"answer\". \n\n1. If you think you should execute some bash code, take bash action, and you should print like this:\n\nThink: put your thought here.\n\nAct: bash\n\n```bash\n# put your bash code here\n```\n\n2. If you think you have finished the task, take finish action, and you should print like this:\n\nThink: put your thought here.\n\nAct: finish\n\n3. If you think you have got the answer to the question, take answer action, and you should print like this:\n\nThink: put your thought here.\n\nAct: answer(Your answer to the question should be put in this pair of parentheses)\n\nIf the output is too long, I will truncate it. The truncated output is not complete. You have to deal with the truncating problem by yourself. Attention, your bash code should not contain any input operation. Once again, you should take only exact one of the three actions in each turn.\n\nNow, my problem is:\n\ntell me how many files are in the directory \"/etc\"?"
    },
    {
      "role": "agent",
      "content": "Think: To count the files in /etc, I need to print all the files in it.\n\nAct: bash\n\n```bash\nls /etc\n```"
    },
    {
      "role": "user",
      "content": "The output of the OS:\ncpi cron.hourly fuse.conf iproute2 lvm networkd-dispatcher protocols selinux tmpfiles.d [truncated because the output is too long]"
    },
    {
      "role": "agent",
      "content": "Think: The output has been truncated because it is too long, so I need to count files by script directly.\n\nAct: bash\n\n```bash\nls -1 /etc | wc -l\n```"
    },
    {
      "role": "user",
      "content": "The output of the OS:\n220"
    },
    {
      "role": "agent",
      "content": "Think: Now I get the answer, it is 220.\n\nAct: answer(220)"
    },
    {
      "role": "user",
      "content": "Now, I will start a new problem in a new OS. My problem is:\n\nThere's a program that writes 'succeed' into /root/log periodically. I only know the interval is 1 second, 2 seconds, or 3 seconds. Please find out the interval. Answer me only the integer number of seconds. (e.g. 1, 2, or 3)"
    },
    {
      "role": "agent",
      "content": "Think: To find out the interval, I need to check the modification time of /root/log repeatedly.\n\nAct: bash\n\n```bash \ninterval=1\nwhile true; do\n    mtime=$(stat -c %Y /root/log)\n    sleep $interval\n    mtime_new=$(stat -c %Y /root/log)\n    if [ $mtime_new -gt $mtime ]; then\n        echo $interval\n        break\n    fi\n    interval=$((interval+1)) \ndone\n```"
    },
    {
      "role": "user",
      "content": "The output of the OS:\n\n\r\n\u001b[?2004l\r[1]+  Exit 1                  python3 -c 'import time\r\nwhile True:\r\nwith open(\"/root/log\", \"a+\") as f:\r\nf.write(\"succeed\")\r\ntime.sleep(1)'\r\n\u001b[?2004h\u001b]0;root@e2b9b07ae039: /\u0007root@e2b9b07ae039:/# "
    },
    {
      "role": "agent",
      "content": "Think: The output shows that the interval is 1 second.\n\nAct: answer(1)"
    }
  ],
  "exception_raised": false
}
abdulfatir commented 1 year ago

I think the syntax in the json file to run multi-line python commands is incorrect. According to this SE answer, $ needs to be added in the beginning as follows.

python3 -c $'import time\nwhile True:\n\twith open(\"/root/log\", \"a+\") as f:\n\t\tf.write(\"succeed\")\n\ttime.sleep(1)' &
abdulfatir commented 1 year ago

Just noticed that this seems to be an issue for many more test cases. If this is indeed how the benchmark was conducted then there may be some discrepancies in the final results. I hope the authors respond promptly and address these concerns.

cc @Longin-Yu

zwhe99 commented 1 year ago

@abdulfatir Absolutely.

Longin-Yu commented 1 year ago

I don't know if this is expected behavior. The following json shows the first data from dev data with the initialization command error: [1]+ Exit 1 python3 -c 'import time\r\nwhile True:\r\nwith open(\"/root/log\", \"a+\") as f:\r\nf.write (\"succeed\")\r\ntime.sleep(1)'.

Similar errors occur in some other data in Dev data.

{
  "index": 0,
  "input": {
    "file": "data/os_interaction/data/dev.json",
    "config": {
      "description": "There's a program that writes 'succeed' into /root/log periodically. I only know the interval is 1 second, 2 seconds, or 3 seconds. Please find out the interval. Answer me only the integer number of seconds. (e.g. 1, 2, or 3)",
      "image": "local-os/default",
      "init_script": [],
      "start": [
        "bash",
        "python3 -c 'import time\nwhile True:\n\twith open(\"/root/log\", \"a+\") as f:\n\t\tf.write(\"succeed\")\n\ttime.sleep(1)' &"
      ],
      "match": {
        "answer": "1",
        "strip": true
      }
    },
    "index": 0
  },
  "output": {
    "result": true,
    "error": null,
    "file": "data/os_interaction/data/dev.json",
    "index_in_file": 0
  },
  "history": [
    {
      "role": "user",
      "content": "You are an assistant that will act like a person, I'will play the role of linux(ubuntu) operating system. Your goal is to implement the operations required by me or answer to the question proposed by me. For each of your turn, you should first think what you should do, and then take exact one of the three actions: \"bash\", \"finish\" or \"answer\". \n\n1. If you think you should execute some bash code, take bash action, and you should print like this:\n\nThink: put your thought here.\n\nAct: bash\n\n```bash\n# put your bash code here\n```\n\n2. If you think you have finished the task, take finish action, and you should print like this:\n\nThink: put your thought here.\n\nAct: finish\n\n3. If you think you have got the answer to the question, take answer action, and you should print like this:\n\nThink: put your thought here.\n\nAct: answer(Your answer to the question should be put in this pair of parentheses)\n\nIf the output is too long, I will truncate it. The truncated output is not complete. You have to deal with the truncating problem by yourself. Attention, your bash code should not contain any input operation. Once again, you should take only exact one of the three actions in each turn.\n\nNow, my problem is:\n\ntell me how many files are in the directory \"/etc\"?"
    },
    {
      "role": "agent",
      "content": "Think: To count the files in /etc, I need to print all the files in it.\n\nAct: bash\n\n```bash\nls /etc\n```"
    },
    {
      "role": "user",
      "content": "The output of the OS:\ncpi cron.hourly fuse.conf iproute2 lvm networkd-dispatcher protocols selinux tmpfiles.d [truncated because the output is too long]"
    },
    {
      "role": "agent",
      "content": "Think: The output has been truncated because it is too long, so I need to count files by script directly.\n\nAct: bash\n\n```bash\nls -1 /etc | wc -l\n```"
    },
    {
      "role": "user",
      "content": "The output of the OS:\n220"
    },
    {
      "role": "agent",
      "content": "Think: Now I get the answer, it is 220.\n\nAct: answer(220)"
    },
    {
      "role": "user",
      "content": "Now, I will start a new problem in a new OS. My problem is:\n\nThere's a program that writes 'succeed' into /root/log periodically. I only know the interval is 1 second, 2 seconds, or 3 seconds. Please find out the interval. Answer me only the integer number of seconds. (e.g. 1, 2, or 3)"
    },
    {
      "role": "agent",
      "content": "Think: To find out the interval, I need to check the modification time of /root/log repeatedly.\n\nAct: bash\n\n```bash \ninterval=1\nwhile true; do\n    mtime=$(stat -c %Y /root/log)\n    sleep $interval\n    mtime_new=$(stat -c %Y /root/log)\n    if [ $mtime_new -gt $mtime ]; then\n        echo $interval\n        break\n    fi\n    interval=$((interval+1)) \ndone\n```"
    },
    {
      "role": "user",
      "content": "The output of the OS:\n\n\r\n\u001b[?2004l\r[1]+  Exit 1                  python3 -c 'import time\r\nwhile True:\r\nwith open(\"/root/log\", \"a+\") as f:\r\nf.write(\"succeed\")\r\ntime.sleep(1)'\r\n\u001b[?2004h\u001b]0;root@e2b9b07ae039: /\u0007root@e2b9b07ae039:/# "
    },
    {
      "role": "agent",
      "content": "Think: The output shows that the interval is 1 second.\n\nAct: answer(1)"
    }
  ],
  "exception_raised": false
}

This may be a severe problem. Thank you for point out this, clearing stdout before interaction may be a solution. We'll fix it in our coming-soon new version.

Longin-Yu commented 1 year ago

I think the syntax in the json file to run multi-line python commands is incorrect. According to this SE answer, $ needs to be added in the beginning as follows.

python3 -c $'import time\nwhile True:\n\twith open(\"/root/log\", \"a+\") as f:\n\t\tf.write(\"succeed\")\n\ttime.sleep(1)' &

Thank you for pointing the issue out. It seems that this is a feature of shell's command line interface. If we use python and run it in an array, this error will not be encountered.

An example is shown as below.

import os, docker, docker.models.containers

client = docker.from_env()
container: docker.models.containers.Container  = client.containers.run(
    "python", detach=True, tty=True, stdin_open=True, remove=True,
)
print(container.status)

python_code = """import json
print(json.dumps({"a": 1}))
"""
result = container.exec_run(["python", "-c", python_code])
print(result.exit_code, result.output.decode("utf-8"), sep="\n")
abdulfatir commented 1 year ago

@Longin-Yu Thanks for your reply. Is there any timeline on the updated version?

abdulfatir commented 1 year ago

Looks like this is somehow related to the issue we're facing here: https://github.com/moby/moby/issues/37366

Xiao9905 commented 11 months ago

@abdulfatir Hi, thanks for your interest in AgentBench. Our v0.2 has been updated to the repo and please have a try. Feel free to reopen this issue if you need anymore help on the topic.