FederatedAI / FATE

An Industrial Grade Federated Learning Framework
Apache License 2.0
5.65k stars 1.55k forks source link

Submit a job return an error #942

Closed cometta closed 2 months ago

cometta commented 4 years ago

we are running on FATE 1.2 . After run one round of job successfully on a distributed machines. We re-submit similar job (from example given) . The command just hang and after awhile, we get response

{ "retcode": 100, "retmsg": "rpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.INTERNAL\n\tdetails = \"0.0.0.0:9370: java.util.concurrent.TimeoutException: [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {\"task\":{\"taskId\":\"202001170655528598743\"},\"src\":{\"name\":\"202001170655528598743\",\"partyId\":\"10000\",\"role\":\"fateflow\",\"callback\":{\"ip\":\"0.0.0.0\",\"port\":9360}},\"dst\":{\"name\":\"202001170655528598743\",\"partyId\":\"9999\",\"role\":\"fateflow\"},\"command\":{\"name\":\"fateflow\"},\"operator\":\"POST\",\"conf\":{\"overallTimeout\":\"60000\"}}, lastPacketTimestamp: 1579244152874, loopEndTimestamp: 1579244212882\n\tat com.webank.ai.fate.networking.proxy.grpc.service.DataTransferPipedServerImpl.unaryCall(DataTransferPipedServerImpl.java:245)\n\tat com.webank.ai.fate.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:346)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n\"\n\tdebug_error_string = \"{\"created\":\"@1579244212.885988745\",\"description\":\"Error received from peer\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1017,\"grpc_message\":\"0.0.0.0:9370: java.util.concurrent.TimeoutException: [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {\"task\":{\"taskId\":\"202001170655528598743\"},\"src\":{\"name\":\"202001170655528598743\",\"partyId\":\"10000\",\"role\":\"fateflow\",\"callback\":{\"ip\":\"0.0.0.0\",\"port\":9360}},\"dst\":{\"name\":\"202001170655528598743\",\"partyId\":\"9999\",\"role\":\"fateflow\"},\"command\":{\"name\":\"fateflow\"},\"operator\":\"POST\",\"conf\":{\"overallTimeout\":\"60000\"}}, lastPacketTimestamp: 1579244152874, loopEndTimestamp: 1579244212882\n\tat com.webank.ai.fate.networking.proxy.grpc.service.DataTransferPipedServerImpl.unaryCall(DataTransferPipedServerImpl.java:245)\n\tat com.webank.ai.fate.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:346)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n\",\"grpc_status\":13}\"\n>" }

On the Fateboard, the progress is zero. We have to press 'kill' to terminate the job. After that, restart 'proxy' service on each of the machines. Re-run the command to submit job. The execution is successfully executed. Any issue with the 'proxy' service ? This happen quite often. There is no error in the proxy's log. This issue is overcome with a restart. Any idea why this is happenning? We are using docker-compose

jarviszeng-zjc commented 4 years ago

Can you provide proxy logs? The path is logs/fate-proxy.log. And then please check if there are any other large data jobs running. Thanks.

cometta commented 4 years ago

@zengjice @dylan-fan after futher testing, here is the outcome that i can share. If all my workers (arbiter, guest,host..) are running on different machines on the LAN , then is no issue. When the workers are running on Azure with external IP address, the RPC will throws error from time to time (after a few jobs successfully submitted). I will get below error and the new job won't be able to submit with rpc error.

eb 11, 2020 12:34:51 PM io.grpc.netty.shaded.io.grpc.netty.NettyServerTransport notifyTerminated
INFO: Transport failed
java.io.IOException: Operation timed out
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
    at sun.nio.ch.IOUtil.read(IOUtil.java:192)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
    at io.grpc.netty.shaded.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
    at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1128)
    at io.grpc.netty.shaded.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
    at io.grpc.netty.shaded.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
    at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
    at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
    at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
    at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
    at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)

When the log show above error, new job won't be able to be submitted.

cometta commented 4 years ago

when submit a job i get below error. this only happen time to time.

[ERROR] 2020-02-12T08:21:22,231 [grpcServiceExecutor-187] [DataTransferPipedServerImpl:243] - [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {"task":{"taskId":"2020021208202220896228"},"src":{"name":"2020021208202220896228","partyId":"10000","role":"fateflow","callback":{"ip":"0.0.0.0","port":9360}},"dst":{"name":"2020021208202220896228","partyId":"9999","role":"fateflow"},"command":{"name":"fateflow"},"operator":"POST","conf":{"overallTimeout":"60000"}}, lastPacketTimestamp: 1581495622222, loopEndTimestamp: 1581495682230

cometta commented 4 years ago

I created a PR https://github.com/FederatedAI/FATE/pull/993