THUDM / AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
https://llmbench.ai
Apache License 2.0
2.01k stars 136 forks source link

[Bug/Assistance] #109

Open ibingzhaoi opened 5 months ago

ibingzhaoi commented 5 months ago

Describe the bug A clear and concise description of what the bug is.

请问这几个在Ubuntu上是不是有问题? Docker能启动,可是GPT4/GPT3全失败了 longinyu/agentbench-ltp longinyu/agentbench-mind2web longinyu/agentbench-card_game longinyu/agentbench-alfworld

Error for cg as below: {"index": 9, "error": null, "info": null, "output": {"index": 9, "status": "task error", "result": "Traceback (most recent call last):\n File \"/root/workspace/src/server/task_worker.py\", line 108, in task_start_sample_wrapper\n result = await self.task.start_sample(index, session)\n File \"/root/workspace/src/server/tasks/card_game/task.py\", line 134, in start_sample\n await task\n File \"/root/workspace/src/server/tasks/card_game/server.py\", line 29, in start\n dat a = client_socket.recv(1000000).decode()\nUnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 4: invalid continuation byte\n", "history": []}, "time": {"timestamp": 1706327091710, "str": "2024-01-27 03:44:51"}}

{"index": 18, "error": "START_FAILED", "info": "{\"detail\":\"Error: Worker not responding\n\"}", "output": null, "time": {"timestamp": 1706325926451, "str": "2024-01-27 03:25:26"}}

是什么地方设置错误了么

zhc7 commented 5 months ago

Hi, @ibingzhaoi 你是在Mac上跑的吗?如果是的话可能是因为https://github.com/THUDM/AgentBench/issues/84#issuecomment-1872249318

Joe-2002 commented 4 months ago

您好,我遇到了一样的问题。在运行os-std时,目前除了dbbench能够正常运行以外,其他任务我都无法运行。 错误包含:os-std的"task error"、kg的“AGENT_FAILED” —————————————————————————— 以下是os的run.json中的输出。 {"index": "std-007-bootstrap-00082", "error": null, "info": null, "output": {"index": "std-007-bootstrap-00082", "status": "task error", "result": "Traceback (most recent call last):\n File \"G:\\u674e\u67ef\u8fb0\\u6c5f\u82cf\u9716\u627f\u79d1\u6280\u6709\u9650\u516c\u53f8\\u5f00\u6e90\agentbench\src\server\task_worker.py\", line 108, in task_start_sample_wrapper\n result = await self.task.start_sample(index, session)\n File \"G:\\u674e\u67ef\u8fb0\\u6c5f\u82cf\u9716\u627f\u79d1\u6280\u6709\u9650\u516c\u53f8\\u5f00\u6e90\agentbench\src\server\tasks\os_interaction\task.py\", line 362, in start_sample\n container = Container(config.image)\n File \"G:\\u674e\u67ef\u8fb0\\u6c5f\u82cf\u9716\u627f\u79d1\u6280\u6709\u9650\u516c\u53f8\\u5f00\u6e90\agentbench\src\server\tasks\os_interaction\task.py\", line 37, in init\n self.sock = self.client.api.exec_start(self.exec_id, socket=True)._sock\nAttributeError: 'NpipeSocket' object has no attribute '_sock'\n", "history": []}, "time": {"timestamp": 1708589987869, "str": "2024-02-22 16:19:47"}} 这是docker的情况 2024-02-22-8605 ———————————————————— 以下是运行kg时,遇到的报错。 python -m src.start_task -a INFO: Started server process [38924] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:5000 (Press CTRL+C to quit) INFO: 127.0.0.1:29878 - "GET /api/list_workers HTTP/1.1" 200 OK Traceback (most recent call last): File "C:\Users\78523\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\78523\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "G:\agentbench\src\start_task.py", line 129, in _start_worker(key, base_port, controller_addr, File "G:\agentbench\src\start_task.py", line 18, in _start_worker subprocess.Popen( File "C:\Users\78523\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 971, in init self._execute_child(args, executable, preexec_fn, close_fds, File "C:\Users\78523\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 1456, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] 系统找不到指定的文件。

@ibingzhaoi @zhc7 @cenyk1230 @Btlmd

zhc7 commented 4 months ago

Hi, @Joe-2002 我们可以在另外一个issue里讨论。报错的位置其实是一个依赖于系统的实现,在linux上运行不会出现问题。AgentFailed可能是由于与agent服务器通信不畅

qzd-1 commented 2 months ago

对于db任务,经常出现AGENT_FAILED导致输出不了over_all文件,请问如何手动计算overall_cat_accuracy

zhc7 commented 2 months ago

Hi, @qzd-1 可以手动读一下runs.jsonl里成功运行里每次的结果,然后统计一下准确率。cat指的是categorical,也就是每类(SELECT,INSERT,UPDATE)分别统计准确率然后以相同的权重取平均。