THUDM / AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
https://llmbench.ai
Apache License 2.0
2.03k stars 138 forks source link

关于('{"detail":"Error: Task does not exist"}', 400, 'alfworld-std')问题 #73

Closed XiaoShihua closed 8 months ago

XiaoShihua commented 8 months ago

您好,我现在能够跑通os,dbbench和kg的评测,但涉及docker镜像的全部提示错误: File "/mnt/d/pyProject/AgentBench-main/src/assigner.py", line 425, in Assigner(value, args.retry).start(tqdm_out=orig_stdout) File "/mnt/d/pyProject/AgentBench-main/src/assigner.py", line 94, in init self.task_indices[task] = self.tasks[task].get_indices() File "/mnt/d/pyProject/AgentBench-main/src/client/task.py", line 32, in get_indices raise AgentBenchException(result.text, result.status_code, self.name) src.typings.exception.AgentBenchException: ('{"detail":"Error: Task does not exist"}', 400, 'alfworld-std')

最近排查发现镜像是启动成功的,后进入镜像ps aux发现进程中有如下命令: python -m src.server.task_worker alfworld-std --self http://localhost:5010/api --port 5011 --controller http://localhost:5000/api

由于无任何日志,便手动更改命令行执行 python -m src.server.task_worker alfworld-std --self http://localhost:6011/api --port 6011 --controller http://localhost:6000/api

发现问题: Heartbeat failed: Cannot connect to host localhost:6000 ssl:default [Connect call failed ('127.0.0.1', 6000)]

我的理解是docker容器和本地ip连接问题,可能不互通。后又排查启动容器命令带有--network host字段,按理说不应该存在该问题。

目前没有很好的解决方法,不知能否给出解决建议。

我的环境是:win11 + docker desktop for windows + wsl2(ubuntu20)

zhc7 commented 8 months ago

Hi, @XiaoShihua .Task does not exist可能是由于worker没有正确连接到controller造成的。HeartBeat failed原因是在6000端口没有controller在运行。默认controller会运行在5000端口。可以分享一下之前都有跑过哪些指令吗?例如controller是如何启动的?

XiaoShihua commented 8 months ago

我首先使用python -m src.start_task -a,启动镜像; 然后python -m src.assigner便会提示:src.typings.exception.AgentBenchException: ('{"detail":"Error: Task does not exist"}', 400, 'alfworld-std') 为了找到问题,我便进入镜像中ps aux看了一下,并手动执行: python -m src.server.task_worker alfworld-std --self http://localhost:6011/api --port 6011 --controller http://localhost:6000/api 发现报错: Heartbeat failed: Cannot connect to host localhost:6000 ssl:default [Connect call failed ('127.0.0.1', 6000)]。 当然5000我也试过: python -m src.server.task_worker alfworld-std --self http://localhost:5011/api --port 5011 --controller http://localhost:5000/api 依旧会报错: Heartbeat failed: Cannot connect to host localhost:5000 ssl:default [Connect call failed ('127.0.0.1', 5000)]。

XiaoShihua commented 8 months ago

我的文件start_task.yaml配置如下:

definition:
  import: tasks/task_assembly.yaml

start:
  os-std: 5         
  alfworld-std: 5

default.yaml配置如下:

import: definition.yaml

concurrency:
  task:
    os-std: 5
    alfworld-std: 5
  agent:
    minimax: 5
assignments: # List[Assignment] | Assignment
  - agent: # "task": List[str] | str ,  "agent": List[str] | str
      - minimax
    task:
      - os-std
      - alfworld-std

output: "outputs/{TIMESTAMP}"
zhc7 commented 8 months ago

start_task和assigner命令是全部在wsl中执行的吗?如果assigner和start_task不是在同一个地方执行的可能会出问题。start_task的-a参数会自动启动一个controller,如果是绕过start_task直接进docker执行的话此时是没有controller的,所以连接不上。如果希望手动启动一个controller,可以执行python -m src.server.task_controller,默认在5000端口,可以用-p参数更改。另外python -m src.start_task -a执行完以后按理来说应该会有一些输出信息。

XiaoShihua commented 8 months ago

是的,都是在wsl中执行的,启动python -m src.start_task -a获得到如下: INFO: Started server process [15648] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:5003 (Press CTRL+C to quit) INFO: 127.0.0.1:58960 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK INFO: Started server process [15650] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Started server process [15649] INFO: Waiting for application startup. INFO: Uvicorn running on http://0.0.0.0:5005 (Press CTRL+C to quit) INFO: Application startup complete. INFO: 127.0.0.1:58966 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK INFO: Uvicorn running on http://0.0.0.0:5004 (Press CTRL+C to quit) INFO: 127.0.0.1:58968 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK <src.server.tasks.os_interaction.task.OSInteraction object at 0x7f3e25755460> INFO: Started server process [15646] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:5001 (Press CTRL+C to quit) INFO: 127.0.0.1:58978 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:5009 (Press CTRL+C to quit) INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:5007 (Press CTRL+C to quit) INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:5006 (Press CTRL+C to quit) <src.server.tasks.os_interaction.task.OSInteraction object at 0x7faaa2a8a7f0> INFO: Started server process [15647] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:5002 (Press CTRL+C to quit) INFO: 127.0.0.1:59030 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:5008 (Press CTRL+C to quit) INFO: Started server process [1] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:5010 (Press CTRL+C to quit) INFO: 127.0.0.1:45202 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK INFO: 127.0.0.1:45202 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK INFO: 127.0.0.1:42476 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK

我在src.server.task_controller中定位到报错位置添加打印

    async def get_indices(self, name: str):
        async with self.tasks_lock:
            if name not in self.tasks:
                print("1111", name, self.tasks)
                raise HTTPException(400, "Error: Task does not exist")
            return self.tasks[name].indices

获得如下信息:

1111 alfworld-std {'os-std': <__main__.TaskData object at 0x7f82763b6340>}
INFO:     127.0.0.1:42428 - "GET /api/get_indices?name=alfworld-std HTTP/1.1" 400 Bad Request

感觉像是alfworld-std任务没有被添加到self.tasks中,故报错。

zhc7 commented 8 months ago

从输出来看启动过程是完全正常的。按理来说worker启动之后会自动连接到controller,此时controller会将worker记录下来。我觉得有一种可能是assigner启动地太快,快过了alfworld-std的worker的启动速度,此时controller和worker还没有连接,因此导致了错误。

XiaoShihua commented 8 months ago

感觉是docker for win的原因,我才wsl直接安装dokcer能够成功运行了

zhc7 commented 8 months ago

如果还有问题的话欢迎开新issue!