Open desertfoxfj opened 1 year ago
confs-$party_id/confs/nginx/route_table.yaml
文件之前部署过1.3.1版本 仔细看了一下,9999服务器的federatedai/fateflow-nn:1.9.0-release一直启动失败,报错信息如下 ModuleNotFoundError: No module named 'federatedml.components'
https://github.com/FederatedAI/KubeFATE/tree/master/docker-deploy#deleting-the-cluster
根据这个文档删除掉之前部署的一些痕迹,然后重新部署一下fate集群再试一次。 也要检查volume和network是否已经清理干净。 以下两个命令可供参考:
docker volume rm $(docker volume ls -q | grep 9999)
docker network rm confs-9999_fate-network
What deployment mode you are use?
What KubeFATE and FATE version you are using? kubefate1.9.0
MUST Please state the KubeFATE and FATE version you found the issue kubefate1.9.0
What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS. Ubuntu 20.04.4 LTS
Desktop (please complete the following information): Ubuntu 20.04.4 LTS
To Reproduce 双机部署toy_example验证报错
parties.conf配置信息如下:
!/bin/bash
user=root dir=/data/projects/fate party_list=(10000 9999) party_ip_list=(192.168.113.171 192.168.113.172) serving_ip_list=(192.168.113.171 192.168.113.172)
Engines:
Computing : Eggroll, Spark, Spark_local
computing=Eggroll
Federation: Eggroll(computing: Eggroll), Pulsar/RabbitMQ(computing: Spark/Spark_local)
federation=Eggroll
Storage: Eggroll(computing: Eggroll), HDFS(computing: Spark), LocalFS(computing: Spark_local)
storage=Eggroll
Algorithm: Basic, NN
algorithm=Basic
Device: IPCL, CPU
device=CPU
spark and eggroll
compute_core=8
default
exchangeip=
modify if you are going to use an external db
mysql_ip=mysql mysql_user=fate mysql_password=fate_dev mysql_db=fate_flow
name_node=hdfs://namenode:9000
Define fateboard login information
fateboard_username=admin fateboard_password=admin
Define serving admin login information
serving_admin_username=admin serving_admin_password=admin
What happen? 运行toy_example验证报错 root@ai171:~# docker exec -it confs-10000_client_1 bash root@598d664db519:/data/projects/fate# flow test toy --guest-party-id 10000 --host-party-id 9999 { "jobId": "202211151621252611600", "retcode": 103, "retmsg": "Traceback (most recent call last):\n File \"/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py\", line 142, in submit\n raise Exception(\"create job failed\", response)\nException: ('create job failed', {'guest': {10000: {'data': {'components': {'secure_add_example_0': {'need_run': True}}}, 'retcode': 0, 'retmsg': 'success'}}, 'host': {9999: {'retcode': <RetCode.FEDERATED_ERROR: 104>, 'retmsg': 'Federated schedule error, <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"UNAVAILABLE: \n[Roll Site Error TransInfo] \n location msg=UNAVAILABLE: io exception \n stack info=io.grpc.StatusRuntimeException: UNAVAILABLE: io exception\n\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)\n\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)\n\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)\n\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\n\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\n\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180)\n\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\n\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\n\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\n\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:750)\nCaused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: No route to host: fateflow/192.167.0.100:9360\nCaused by: java.net.ConnectException: finishConnect(..) failed: No route to host\n\tat io.grpc.netty.shaded.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)\n\tat io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)\n\tat io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\n\tat io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.lang.Thread.run(Thread.java:750)\n \n\nexception trans path: rollsite(9999) --> rollsite(10000)\"\n\tdebug_error_string = \"{\"created\":\"@1668529292.149849671\",\"description\":\"Error received from peer ipv4:192.167.0.7:9370\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"UNAVAILABLE: \\n[Roll Site Error TransInfo] \\n location msg=UNAVAILABLE: io exception \\n stack info=io.grpc.StatusRuntimeException: UNAVAILABLE: io exception\\n\\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)\\n\\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)\\n\\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)\\n\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\\n\\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\\n\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\\n\\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180)\\n\\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\\n\\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\\n\\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\\n\\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\\n\\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\\n\\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814)\\n\\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\\n\\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\\n\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat java.lang.Thread.run(Thread.java:750)\\nCaused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: No route to host: fateflow/192.167.0.100:9360\\nCaused by: java.net.ConnectException: finishConnect(..) failed: No route to host\\n\\tat io.grpc.netty.shaded.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)\\n\\tat io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)\\n\\tat io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\\n\\tat io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\\n\\tat io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\\n\\tat java.lang.Thread.run(Thread.java:750)\\n \\n\\nexception trans path: rollsite(9999) --> rollsite(10000)\",\"grpc_status\":14}\"\n>'}}})\n" }
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.