Closed cometta closed 4 months ago
Can you provide proxy logs? The path is logs/fate-proxy.log. And then please check if there are any other large data jobs running. Thanks.
@zengjice @dylan-fan after futher testing, here is the outcome that i can share. If all my workers (arbiter, guest,host..) are running on different machines on the LAN , then is no issue. When the workers are running on Azure with external IP address, the RPC will throws error from time to time (after a few jobs successfully submitted). I will get below error and the new job won't be able to submit with rpc error.
eb 11, 2020 12:34:51 PM io.grpc.netty.shaded.io.grpc.netty.NettyServerTransport notifyTerminated
INFO: Transport failed
java.io.IOException: Operation timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.grpc.netty.shaded.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.grpc.netty.shaded.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1128)
at io.grpc.netty.shaded.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
at io.grpc.netty.shaded.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
When the log show above error, new job won't be able to be submitted.
when submit a job i get below error. this only happen time to time.
[ERROR] 2020-02-12T08:21:22,231 [grpcServiceExecutor-187] [DataTransferPipedServerImpl:243] - [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {"task":{"taskId":"2020021208202220896228"},"src":{"name":"2020021208202220896228","partyId":"10000","role":"fateflow","callback":{"ip":"0.0.0.0","port":9360}},"dst":{"name":"2020021208202220896228","partyId":"9999","role":"fateflow"},"command":{"name":"fateflow"},"operator":"POST","conf":{"overallTimeout":"60000"}}, lastPacketTimestamp: 1581495622222, loopEndTimestamp: 1581495682230
I created a PR https://github.com/FederatedAI/FATE/pull/993
we are running on FATE 1.2 . After run one round of job successfully on a distributed machines. We re-submit similar job (from example given) . The command just hang and after awhile, we get response
On the Fateboard, the progress is zero. We have to press 'kill' to terminate the job. After that, restart 'proxy' service on each of the machines. Re-run the command to submit job. The execution is successfully executed. Any issue with the 'proxy' service ? This happen quite often. There is no error in the proxy's log. This issue is overcome with a restart. Any idea why this is happenning? We are using docker-compose