FederatedAI / FATE

An Industrial Grade Federated Learning Framework
Apache License 2.0
5.65k stars 1.55k forks source link

联邦学习的各个节点处于不同网络中无法互通,只有一个中间节点可以和每个节点互通场景下的集群部署问题 #4634

Closed desertfoxfj closed 2 months ago

desertfoxfj commented 1 year ago

Describe the bug 项目场景中联邦学习的各个节点处于不同网络中无法互通,只有一个中间节点可以和每个节点互通。

按我的理解,中间节点应该部署exchange,在 /data/projects/fate/eggroll/conf/route_table.json中配置处于不同网络中的各party信息,并分别与各个节点做免密登录。 而各个节点采用单节点的AllinOne部署方式,修改/data/projects/fate/eggroll/conf/route_table.json部分,默认路由信息指向部署好的exchange,修改后需重启rollsite。

我现在的疑问是,官方的AllinOne部署文档,需要配置host和guest,单节点的情况下,是host和guest配置相同,还是仅保留其中一个的配置?

dylan-fan commented 1 year ago

配置一个就行

desertfoxfj commented 1 year ago

我配置了一个集群,遇到的问题: 192.168.113.207 exchange 192.168.113.208 party_id 10000 192.168.113.209 party_id 9999

192.168.113.207 exchange的route_table.json配置: { "route_table": { "10000": { "default":[ { "port": 9370, "ip": "192.168.113.208" } ] }, "9999": { "default":[ { "port": 9370, "ip": "192.168.113.209" } ] } }, "permission": { "default_allow": true } }

192.168.113.208 party_id 10000 单节点安装时的setup.conf

host party id

host_id="10000"

host ip

host_ip="192.168.113.208"

单节点安装好后的配置exchange的route_table.json配置 { "route_table": { "default": { "default":[ { "port": 9370, "ip": "192.168.113.207" } ] } }, "permission": { "default_allow": true } }

192.168.113.209 party_id 9999 单节点安装时的setup.conf

host party id

host_id="9999"

host ip

host_ip="192.168.113.209"

单节点安装好后的配置exchange的route_table.json配置 { "route_table": { "default": { "default":[ { "port": 9370, "ip": "192.168.113.207" } ] } }, "permission": { "default_allow": true } }

在192.168.113.208和192.168.113.209上做单点测试都正常 flow test toy -gid 10000 -hid 10000 flow test toy -gid 9999 -hid 9999

在192.168.113.209上做双边toy测试报错 (venv) app@fate3:/data/projects/fate/eggroll$ flow test toy -gid 9999 -hid 10000 { "jobId": "202302221247060213060", "retcode": 103, "retmsg": "Traceback (most recent call last):\n File \"/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py\", line 142, in submit\n raise Exception(\"create job failed\", response)\nException: ('create job failed', {'guest': {9999: {'retcode': <RetCode.FEDERATED_ERROR: 104>, 'retmsg': 'Federated schedule error, <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.DEADLINE_EXCEEDED\n\tdetails = \"Deadline Exceeded\"\n\tdebug_error_string = \"{\"created\":\"@1677070122.088949519\",\"description\":\"Deadline Exceeded\",\"file\":\"src/core/ext/filters/deadline/deadline_filter.cc\",\"file_line\":81,\"grpc_status\":4}\"\n>'}}, 'host': {10000: {'retcode': <RetCode.FEDERATED_ERROR: 104>, 'retmsg': 'Federated schedule error, <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.DEADLINE_EXCEEDED\n\tdetails = \"Deadline Exceeded\"\n\tdebug_error_string = \"{\"created\":\"@1677070218.100983527\",\"description\":\"Deadline Exceeded\",\"file\":\"src/core/ext/filters/deadline/deadline_filter.cc\",\"file_line\":81,\"grpc_status\":4}\"\n>'}}})\n" }

在192.168.113.207上的/rollsite报错信息 tail -f /data/projects/fate/eggroll/logs/eggroll/rollsite.jvm.log

[ERROR][67646][2023-02-22 12:48:10,508][grpc-server-9370-664,pid:7881,tid:686][c.w.e.r.EggSiteServicer:144] - [UNARYCALL][SERVER] onError. rsKey=__rsk#######, metadata={"task":{"taskId":"202302221247060213060","model":{"name":"headers","dataKey":"{\"User-Agent\": \"fateflow/1.10.0\", \"service\": \"fateflow\", \"src_fate_ver\": \"1.10.0\", \"src_party_id\": \"9999\", \"dest_party_id\": \"9999\", \"src_role\": \"guest\"}"}},"src":{"name":"202302221247060213060","partyId":"9999","role":"fateflow","callback":{"ip":"192.168.113.209","port":9360,"hostname":""}},"dst":{"name":"202302221247060213060","partyId":"9999","role":"fateflow"},"command":{"name":"/v1/party/202302221247060213060/guest/9999/create"},"operator":"POST","seq":"0","ack":"0","conf":{"overallTimeout":"30000","completionWaitTimeout":"0","packetIntervalTimeout":"0","maxRetries":0},"ext":"","version":""} io.grpc.StatusRuntimeException: CANCELLED: [Roll Site Error TransInfo] location msg=CANCELLED: io.grpc.Context was cancelled without error stack info=io.grpc.StatusRuntimeException: CANCELLED: io.grpc.Context was cancelled without error at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) at com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348) at com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138) at com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40) at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

exception trans path: 127.0.0.1(10001) at io.grpc.Status.asRuntimeException(Status.java:524) ~[grpc-api-1.31.2.jar:1.31.2] at com.webank.eggroll.rollsite.TransferExceptionUtils$.throwableToException(TransferExceptionUtils.scala:43) ~[eggroll-roll-site-2.4.8.jar:?] at com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:154) [eggroll-roll-site-2.4.8.jar:?] at com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406) [eggroll-core-2.4.8.jar:?] at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180) [grpc-stub-1.31.2.jar:1.31.2] at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) [grpc-api-1.31.2.jar:1.31.2] at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) [grpc-api-1.31.2.jar:1.31.2] at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40) [grpc-api-1.31.2.jar:1.31.2] at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86) [grpc-api-1.31.2.jar:1.31.2] at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) [grpc-core-1.31.2.jar:1.31.2] at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814) [grpc-core-1.31.2.jar:1.31.2] at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [grpc-core-1.31.2.jar:1.31.2] at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) [grpc-core-1.31.2.jar:1.31.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_261] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_261] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_261]

tail -f /data/projects/fate/eggroll/logs/eggroll/rollsite.jvm.err.log

[ERROR][389887][2023-02-22 12:53:32,749][grpc-server-9370-306,pid:7881,tid:328][c.w.e.r.EggSiteServicer:144] - [UNARYCALL][SERVER] onError. rsKey=__rsk#######, metadata={"task":{"taskId":"202302221247060213060","model":{"name":"headers","dataKey":"{\"User-Agent\": \"fateflow/1.10.0\", \"service\": \"fateflow\", \"src_fate_ver\": \"1.10.0\", \"src_party_id\": \"9999\", \"dest_party_id\": \"10000\", \"src_role\": \"guest\"}"}},"src":{"name":"202302221247060213060","partyId":"9999","role":"fateflow","callback":{"ip":"192.168.113.209","port":9360,"hostname":""}},"dst":{"name":"202302221247060213060","partyId":"10000","role":"fateflow"},"command":{"name":"/v1/party/202302221247060213060/host/10000/status/failed"},"operator":"POST","seq":"0","ack":"0","conf":{"overallTimeout":"30000","completionWaitTimeout":"0","packetIntervalTimeout":"0","maxRetries":0},"ext":"","version":""} io.grpc.StatusRuntimeException: CANCELLED: [Roll Site Error TransInfo] location msg=CANCELLED: io.grpc.Context was cancelled without error stack info=io.grpc.StatusRuntimeException: CANCELLED: io.grpc.Context was cancelled without error at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) at com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348) at com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138) at com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40) at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

exception trans path: 127.0.0.1(10001) at io.grpc.Status.asRuntimeException(Status.java:524) ~[grpc-api-1.31.2.jar:1.31.2] at com.webank.eggroll.rollsite.TransferExceptionUtils$.throwableToException(TransferExceptionUtils.scala:43) ~[eggroll-roll-site-2.4.8.jar:?] at com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:154) [eggroll-roll-site-2.4.8.jar:?] at com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406) [eggroll-core-2.4.8.jar:?] at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180) [grpc-stub-1.31.2.jar:1.31.2] at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) [grpc-api-1.31.2.jar:1.31.2] at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) [grpc-api-1.31.2.jar:1.31.2] at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40) [grpc-api-1.31.2.jar:1.31.2] at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86) [grpc-api-1.31.2.jar:1.31.2] at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) [grpc-core-1.31.2.jar:1.31.2] at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814) [grpc-core-1.31.2.jar:1.31.2] at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [grpc-core-1.31.2.jar:1.31.2] at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) [grpc-core-1.31.2.jar:1.31.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_261] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_261] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_261]

github-actions[bot] commented 2 months ago

This issue has been marked as stale because it has been open for 365 days with no activity. If this issue is still relevant or if there is new information, please feel free to update or reopen it.

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 1 days since being marked as stale. If this issue is still relevant or if there is new information, please feel free to update or reopen it.