FederatedAI / FATE

An Industrial Grade Federated Learning Framework
Apache License 2.0
5.71k stars 1.55k forks source link

Error when running hetero_logistic_regression example in cluster #680

Closed ijnmklpo closed 4 months ago

ijnmklpo commented 5 years ago

Hello,I deployed FATE cluster on two machines, and saw the two machine can connect to each other in FATEBoard. So I tried to run hetero_logistic_regression example, it is successful to upload breast_a.csv and breast_b.csv(though I don't find the way to see the result, just get success massage). But when started the model, It occurred the error as below:

[app@vm_0_1_centos federatedml-1.0-examples]$ python /data/projects/FATE/fate_flow/fate_flow_client.py -f submit_job -d hetero_logistic_regression/test_hetero_lr_train_job_dsl.json -c hetero_logistic_regression/test_hetero_lr_train_job_conf.json
{
    "data": null,
    "jobId": null,
    "meta": null,
    "retcode": 100,
    "retmsg": "rpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.INTERNAL\n\tdetails = \"xxx.xxx.xxx.xxx:9370: java.util.concurrent.TimeoutException: [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {\"task\":{\"taskId\":\"201910160517383614267\"},\"src\":{\"name\":\"201910160517383614267\",\"partyId\":\"10000\",\"role\":\"fateflow\",\"callback\":{\"ip\":\"0.0.0.0\",\"port\":9360}},\"dst\":{\"name\":\"201910160517383614267\",\"partyId\":\"9999\",\"role\":\"fateflow\"},\"command\":{\"name\":\"fateflow\"},\"operator\":\"POST\",\"conf\":{\"overallTimeout\":\"60000\"}}, lastPacketTimestamp: 1571217458461, loopEndTimestamp: 1571217518745\n\tat com.webank.ai.fate.networking.proxy.grpc.service.DataTransferPipedServerImpl.unaryCall(DataTransferPipedServerImpl.java:245)\n\tat com.webank.ai.fate.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:346)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n\"\n\tdebug_error_string = \"{\"created\":\"@1571217518.751205745\",\"description\":\"Error received from peer\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1017,\"grpc_message\":\"xxx.xxx.xxx.xxx:9370: java.util.concurrent.TimeoutException: [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {\"task\":{\"taskId\":\"201910160517383614267\"},\"src\":{\"name\":\"201910160517383614267\",\"partyId\":\"10000\",\"role\":\"fateflow\",\"callback\":{\"ip\":\"0.0.0.0\",\"port\":9360}},\"dst\":{\"name\":\"201910160517383614267\",\"partyId\":\"9999\",\"role\":\"fateflow\"},\"command\":{\"name\":\"fateflow\"},\"operator\":\"POST\",\"conf\":{\"overallTimeout\":\"60000\"}}, lastPacketTimestamp: 1571217458461, loopEndTimestamp: 1571217518745\\n\\tat com.webank.ai.fate.networking.proxy.grpc.service.DataTransferPipedServerImpl.unaryCall(DataTransferPipedServerImpl.java:245)\\n\\tat com.webank.ai.fate.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:346)\\n\\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)\\n\\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)\\n\\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)\\n\\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\\n\\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\\n\\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat java.lang.Thread.run(Thread.java:748)\\n\",\"grpc_status\":13}\"\n>"
}
4Details commented 5 years ago

The following methods may be useful to you. please use 'netstat -nltp' to check whether the tcp port 50001,50002 is opening if not, you can do as following ''' cd data/projects/fate/egg sh service.sh stop cd ../fate/python sh service.sh start sh service.sh stop cd ../fate/egg sh service.sh start ''' and then run your sumbit code again.

ijnmklpo commented 5 years ago

The following methods may be useful to you. please use 'netstat -nltp' to check whether the tcp port 50001,50002 is opening if not, you can do as following ''' cd data/projects/fate/egg sh service.sh stop cd ../fate/python sh service.sh start sh service.sh stop cd ../fate/egg sh service.sh start ''' and then run your sumbit code again.

Thank you for your help. But It seems the ports are opened image

and I also tried your code, it doesn't work

ijnmklpo commented 5 years ago

And there is another question, when I run homo LR example, it is failed immediately, and return the msg like this: image

But I find nothing about execute server in the deploy guidebook, Any dalao can help? thx so much. T_T