Closed desertfoxfj closed 2 months ago
Seems same issue for docker deploy at version 1.9.0
.
Solved using version 1.9.2
.
FIY if anyone facing the same issue.
This issue has been marked as stale because it has been open for 365 days with no activity. If this issue is still relevant or if there is new information, please feel free to update or reopen it.
This issue was closed because it has been inactive for 1 days since being marked as stale. If this issue is still relevant or if there is new information, please feel free to update or reopen it.
Describe the bug KubeFate1.9.0双机部署
parties.conf配置信息如下
!/bin/bash
user=root dir=/data/projects/fate party_list=(10000 9999) party_ip_list=(192.168.113.171 192.168.113.172) serving_ip_list=(192.168.113.171 192.168.113.172)
Engines:
Computing : Eggroll, Spark, Spark_local
computing=Eggroll
Federation: Eggroll(computing: Eggroll), Pulsar/RabbitMQ(computing: Spark/Spark_local)
federation=Eggroll
Storage: Eggroll(computing: Eggroll), HDFS(computing: Spark), LocalFS(computing: Spark_local)
storage=Eggroll
Algorithm: Basic, NN
algorithm=Basic
Device: IPCL, CPU
device=CPU
spark and eggroll
compute_core=8
default
exchangeip=
modify if you are going to use an external db
mysql_ip=mysql mysql_user=fate mysql_password=fate_dev mysql_db=fate_flow
name_node=hdfs://namenode:9000
Define fateboard login information
fateboard_username=admin fateboard_password=admin
Define serving admin login information
serving_admin_username=admin serving_admin_password=admin
部署完成后,运行toy_example验证程序报错 root@ai171:~# docker exec -it confs-10000_client_1 bash root@598d664db519:/data/projects/fate# flow test toy --guest-party-id 10000 --host-party-id 9999 { "jobId": "202211151621252611600", "retcode": 103, "retmsg": "Traceback (most recent call last):\n File \"/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py\", line 142, in submit\n raise Exception(\"create job failed\", response)\nException: ('create job failed', {'guest': {10000: {'data': {'components': {'secure_add_example_0': {'need_run': True}}}, 'retcode': 0, 'retmsg': 'success'}}, 'host': {9999: {'retcode': <RetCode.FEDERATED_ERROR: 104>, 'retmsg': 'Federated schedule error, <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"UNAVAILABLE: \n[Roll Site Error TransInfo] \n location msg=UNAVAILABLE: io exception \n stack info=io.grpc.StatusRuntimeException: UNAVAILABLE: io exception\n\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)\n\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)\n\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)\n\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\n\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\n\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180)\n\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\n\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\n\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\n\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:750)\nCaused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: No route to host: fateflow/192.167.0.100:9360\nCaused by: java.net.ConnectException: finishConnect(..) failed: No route to host\n\tat io.grpc.netty.shaded.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)\n\tat io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)\n\tat io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\n\tat io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.lang.Thread.run(Thread.java:750)\n \n\nexception trans path: rollsite(9999) --> rollsite(10000)\"\n\tdebug_error_string = \"{\"created\":\"@1668529292.149849671\",\"description\":\"Error received from peer ipv4:192.167.0.7:9370\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"UNAVAILABLE: \\n[Roll Site Error TransInfo] \\n location msg=UNAVAILABLE: io exception \\n stack info=io.grpc.StatusRuntimeException: UNAVAILABLE: io exception\\n\\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)\\n\\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)\\n\\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)\\n\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\\n\\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\\n\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\\n\\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180)\\n\\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\\n\\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\\n\\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\\n\\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\\n\\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\\n\\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814)\\n\\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\\n\\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\\n\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat java.lang.Thread.run(Thread.java:750)\\nCaused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: No route to host: fateflow/192.167.0.100:9360\\nCaused by: java.net.ConnectException: finishConnect(..) failed: No route to host\\n\\tat io.grpc.netty.shaded.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)\\n\\tat io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)\\n\\tat io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\\n\\tat io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\\n\\tat io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\\n\\tat java.lang.Thread.run(Thread.java:750)\\n \\n\\nexception trans path: rollsite(9999) --> rollsite(10000)\",\"grpc_status\":14}\"\n>'}}})\n" }