FederatedAI / KubeFATE

Manage federated learning workload using cloud native technologies.
Apache License 2.0
423 stars 221 forks source link

kubefate1.9.0双机部署toy_example验证报错 #801

Open desertfoxfj opened 1 year ago

desertfoxfj commented 1 year ago

What deployment mode you are use?

  1. docker-compose;

What KubeFATE and FATE version you are using? kubefate1.9.0

MUST Please state the KubeFATE and FATE version you found the issue kubefate1.9.0

What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS. Ubuntu 20.04.4 LTS

Desktop (please complete the following information): Ubuntu 20.04.4 LTS

To Reproduce 双机部署toy_example验证报错

parties.conf配置信息如下:

!/bin/bash

user=root dir=/data/projects/fate party_list=(10000 9999) party_ip_list=(192.168.113.171 192.168.113.172) serving_ip_list=(192.168.113.171 192.168.113.172)

Engines:

Computing : Eggroll, Spark, Spark_local

computing=Eggroll

Federation: Eggroll(computing: Eggroll), Pulsar/RabbitMQ(computing: Spark/Spark_local)

federation=Eggroll

Storage: Eggroll(computing: Eggroll), HDFS(computing: Spark), LocalFS(computing: Spark_local)

storage=Eggroll

Algorithm: Basic, NN

algorithm=Basic

Device: IPCL, CPU

device=CPU

spark and eggroll

compute_core=8

default

exchangeip=

modify if you are going to use an external db

mysql_ip=mysql mysql_user=fate mysql_password=fate_dev mysql_db=fate_flow

name_node=hdfs://namenode:9000

Define fateboard login information

fateboard_username=admin fateboard_password=admin

Define serving admin login information

serving_admin_username=admin serving_admin_password=admin

What happen? 运行toy_example验证报错 root@ai171:~# docker exec -it confs-10000_client_1 bash root@598d664db519:/data/projects/fate# flow test toy --guest-party-id 10000 --host-party-id 9999 { "jobId": "202211151621252611600", "retcode": 103, "retmsg": "Traceback (most recent call last):\n File \"/data/projects/fate/fateflow/python/fate_flow/scheduler/dag_scheduler.py\", line 142, in submit\n raise Exception(\"create job failed\", response)\nException: ('create job failed', {'guest': {10000: {'data': {'components': {'secure_add_example_0': {'need_run': True}}}, 'retcode': 0, 'retmsg': 'success'}}, 'host': {9999: {'retcode': <RetCode.FEDERATED_ERROR: 104>, 'retmsg': 'Federated schedule error, <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"UNAVAILABLE: \n[Roll Site Error TransInfo] \n location msg=UNAVAILABLE: io exception \n stack info=io.grpc.StatusRuntimeException: UNAVAILABLE: io exception\n\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)\n\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)\n\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)\n\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\n\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\n\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180)\n\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\n\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\n\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\n\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:750)\nCaused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: No route to host: fateflow/192.167.0.100:9360\nCaused by: java.net.ConnectException: finishConnect(..) failed: No route to host\n\tat io.grpc.netty.shaded.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)\n\tat io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)\n\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)\n\tat io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\n\tat io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.lang.Thread.run(Thread.java:750)\n \n\nexception trans path: rollsite(9999) --> rollsite(10000)\"\n\tdebug_error_string = \"{\"created\":\"@1668529292.149849671\",\"description\":\"Error received from peer ipv4:192.167.0.7:9370\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":952,\"grpc_message\":\"UNAVAILABLE: \\n[Roll Site Error TransInfo] \\n location msg=UNAVAILABLE: io exception \\n stack info=io.grpc.StatusRuntimeException: UNAVAILABLE: io exception\\n\\tat io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)\\n\\tat io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)\\n\\tat io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)\\n\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$DataTransferServiceBlockingStub.unaryCall(DataTransferServiceGrpc.java:348)\\n\\tat com.webank.eggroll.rollsite.EggSiteServicer.unaryCall(EggSiteServicer.scala:138)\\n\\tat com.webank.ai.eggroll.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:406)\\n\\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:180)\\n\\tat io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)\\n\\tat io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)\\n\\tat io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)\\n\\tat io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)\\n\\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331)\\n\\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:814)\\n\\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\\n\\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\\n\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat java.lang.Thread.run(Thread.java:750)\\nCaused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: No route to host: fateflow/192.167.0.100:9360\\nCaused by: java.net.ConnectException: finishConnect(..) failed: No route to host\\n\\tat io.grpc.netty.shaded.io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)\\n\\tat io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:243)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)\\n\\tat io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)\\n\\tat io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\\n\\tat io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\\n\\tat io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\\n\\tat java.lang.Thread.run(Thread.java:750)\\n \\n\\nexception trans path: rollsite(9999) --> rollsite(10000)\",\"grpc_status\":14}\"\n>'}}})\n" }

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

owlet42 commented 1 year ago
  1. 以前是否部署过其他版本
  2. 贴下eggroll的confs-$party_id/confs/nginx/route_table.yaml文件
desertfoxfj commented 1 year ago

之前部署过1.3.1版本 仔细看了一下,9999服务器的federatedai/fateflow-nn:1.9.0-release一直启动失败,报错信息如下 ModuleNotFoundError: No module named 'federatedml.components'

JingChen23 commented 1 year ago

https://github.com/FederatedAI/KubeFATE/tree/master/docker-deploy#deleting-the-cluster

根据这个文档删除掉之前部署的一些痕迹,然后重新部署一下fate集群再试一次。 也要检查volume和network是否已经清理干净。 以下两个命令可供参考:

docker volume rm $(docker volume ls -q | grep 9999)
docker network rm confs-9999_fate-network