FederatedAI / AnsibleFATE

Apache License 2.0
19 stars 9 forks source link

使用ansiblefate部署fate2.0.0版本后,节点无法通信 #27

Closed xinushio closed 1 week ago

xinushio commented 4 months ago

各个节点均可通过单边测试,但是双方测试失败 (venv) app@VM_0_1_centos:/home/guo$ flow test toy -gid 9999 -hid 10000 { "code": 1002, "data": { "model_id": "202402280731033092380", "model_version": "0" }, "job_id": "202402280731033092380", "message": "Traceback (most recent call last):\n File \"/data/projects/fate/fate_flow/python/fate_flow/scheduler/scheduler.py\", line 376, in create_all_job\n raise Exception(\"create job failed\", response)\nException: ('create job failed', {'guest': {'9999': {'code': 0, 'message': 'success'}}, 'host': {'10000': {'code': 104, 'message': 'Federated schedule error, <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = \"\"\n\tdebug_error_string = \"UNKNOWN:Error received from peer {grpc_message:\"\", grpc_status:2, created_time:\"2024-02-28T07:31:03.581646839+00:00\"}\"\n>'}}})\n" }

burstlink commented 3 months ago

我也遇到了,我的原因是deploy时候选了-k="host|exchange",但是没走创建证书流程。后面重新装去掉-k之后就可以了。

robbie228 commented 1 week ago

各个节点均可通过单边测试,但是双方测试失败 (venv) app@VM_0_1_centos:/home/guo$ flow test toy -gid 9999 -hid 10000 { "code": 1002, "data": { "model_id": "202402280731033092380", "model_version": "0" }, "job_id": "202402280731033092380", "message": "Traceback (most recent call last):\n File "/data/projects/fate/fate_flow/python/fate_flow/scheduler/scheduler.py", line 376, in create_all_job\n raise Exception("create job failed", response)\nException: ('create job failed', {'guest': {'9999': {'code': 0, 'message': 'success'}}, 'host': {'10000': {'code': 104, 'message': 'Federated schedule error, <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = ""\n\tdebug_error_string = "UNKNOWN:Error received from peer {grpc_message:"", grpc_status:2, created_time:"2024-02-28T07:31:03.581646839+00:00"}"\n>'}}})\n" }

看报错应该是路由不通,路由配置错误或者证书可以导致这个问题,可以说一下具体的部署步骤吗?

xinushio commented 1 week ago

时间过了挺久了,应该是按照这个流程来的https://github.com/FederatedAI/FATE/blob/v2.0.0/deploy/cluster-deploy/allinone/fate-allinone_deployment_guide.zh.md