Closed XiaoShihua closed 8 months ago
Hi, @XiaoShihua .Task does not exist可能是由于worker没有正确连接到controller造成的。HeartBeat failed原因是在6000端口没有controller在运行。默认controller会运行在5000端口。可以分享一下之前都有跑过哪些指令吗?例如controller是如何启动的?
我首先使用python -m src.start_task -a,启动镜像; 然后python -m src.assigner便会提示:src.typings.exception.AgentBenchException: ('{"detail":"Error: Task does not exist"}', 400, 'alfworld-std') 为了找到问题,我便进入镜像中ps aux看了一下,并手动执行: python -m src.server.task_worker alfworld-std --self http://localhost:6011/api --port 6011 --controller http://localhost:6000/api 发现报错: Heartbeat failed: Cannot connect to host localhost:6000 ssl:default [Connect call failed ('127.0.0.1', 6000)]。 当然5000我也试过: python -m src.server.task_worker alfworld-std --self http://localhost:5011/api --port 5011 --controller http://localhost:5000/api 依旧会报错: Heartbeat failed: Cannot connect to host localhost:5000 ssl:default [Connect call failed ('127.0.0.1', 5000)]。
我的文件start_task.yaml配置如下:
definition:
import: tasks/task_assembly.yaml
start:
os-std: 5
alfworld-std: 5
default.yaml配置如下:
import: definition.yaml
concurrency:
task:
os-std: 5
alfworld-std: 5
agent:
minimax: 5
assignments: # List[Assignment] | Assignment
- agent: # "task": List[str] | str , "agent": List[str] | str
- minimax
task:
- os-std
- alfworld-std
output: "outputs/{TIMESTAMP}"
start_task和assigner命令是全部在wsl中执行的吗?如果assigner和start_task不是在同一个地方执行的可能会出问题。start_task的-a参数会自动启动一个controller,如果是绕过start_task直接进docker执行的话此时是没有controller的,所以连接不上。如果希望手动启动一个controller,可以执行python -m src.server.task_controller
,默认在5000端口,可以用-p
参数更改。另外python -m src.start_task -a
执行完以后按理来说应该会有一些输出信息。
是的,都是在wsl中执行的,启动python -m src.start_task -a
获得到如下:
INFO: Started server process [15648]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5003 (Press CTRL+C to quit)
INFO: 127.0.0.1:58960 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK
INFO: Started server process [15650]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Started server process [15649]
INFO: Waiting for application startup.
INFO: Uvicorn running on http://0.0.0.0:5005 (Press CTRL+C to quit)
INFO: Application startup complete.
INFO: 127.0.0.1:58966 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK
INFO: Uvicorn running on http://0.0.0.0:5004 (Press CTRL+C to quit)
INFO: 127.0.0.1:58968 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK
<src.server.tasks.os_interaction.task.OSInteraction object at 0x7f3e25755460>
INFO: Started server process [15646]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5001 (Press CTRL+C to quit)
INFO: 127.0.0.1:58978 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5009 (Press CTRL+C to quit)
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5007 (Press CTRL+C to quit)
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5006 (Press CTRL+C to quit)
<src.server.tasks.os_interaction.task.OSInteraction object at 0x7faaa2a8a7f0>
INFO: Started server process [15647]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5002 (Press CTRL+C to quit)
INFO: 127.0.0.1:59030 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5008 (Press CTRL+C to quit)
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5010 (Press CTRL+C to quit)
INFO: 127.0.0.1:45202 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK
INFO: 127.0.0.1:45202 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK
INFO: 127.0.0.1:42476 - "POST /api/receive_heartbeat HTTP/1.1" 200 OK
我在src.server.task_controller中定位到报错位置添加打印
async def get_indices(self, name: str):
async with self.tasks_lock:
if name not in self.tasks:
print("1111", name, self.tasks)
raise HTTPException(400, "Error: Task does not exist")
return self.tasks[name].indices
获得如下信息:
1111 alfworld-std {'os-std': <__main__.TaskData object at 0x7f82763b6340>}
INFO: 127.0.0.1:42428 - "GET /api/get_indices?name=alfworld-std HTTP/1.1" 400 Bad Request
感觉像是alfworld-std任务没有被添加到self.tasks中,故报错。
从输出来看启动过程是完全正常的。按理来说worker启动之后会自动连接到controller,此时controller会将worker记录下来。我觉得有一种可能是assigner启动地太快,快过了alfworld-std的worker的启动速度,此时controller和worker还没有连接,因此导致了错误。
感觉是docker for win的原因,我才wsl直接安装dokcer能够成功运行了
如果还有问题的话欢迎开新issue!
您好,我现在能够跑通os,dbbench和kg的评测,但涉及docker镜像的全部提示错误: File "/mnt/d/pyProject/AgentBench-main/src/assigner.py", line 425, in
Assigner(value, args.retry).start(tqdm_out=orig_stdout)
File "/mnt/d/pyProject/AgentBench-main/src/assigner.py", line 94, in init
self.task_indices[task] = self.tasks[task].get_indices()
File "/mnt/d/pyProject/AgentBench-main/src/client/task.py", line 32, in get_indices
raise AgentBenchException(result.text, result.status_code, self.name)
src.typings.exception.AgentBenchException: ('{"detail":"Error: Task does not exist"}', 400, 'alfworld-std')
最近排查发现镜像是启动成功的,后进入镜像ps aux发现进程中有如下命令: python -m src.server.task_worker alfworld-std --self http://localhost:5010/api --port 5011 --controller http://localhost:5000/api
由于无任何日志,便手动更改命令行执行 python -m src.server.task_worker alfworld-std --self http://localhost:6011/api --port 6011 --controller http://localhost:6000/api
发现问题: Heartbeat failed: Cannot connect to host localhost:6000 ssl:default [Connect call failed ('127.0.0.1', 6000)]
我的理解是docker容器和本地ip连接问题,可能不互通。后又排查启动容器命令带有--network host字段,按理说不应该存在该问题。
目前没有很好的解决方法,不知能否给出解决建议。
我的环境是:win11 + docker desktop for windows + wsl2(ubuntu20)