Angel-ML / PyTorch-On-Angel

PyTorch On Angel, arming PyTorch with a powerful Parameter Server, which enable PyTorch to train very big models.
164 stars 51 forks source link

send stop command to Master failed #6

Closed beyondliyang closed 3 years ago

beyondliyang commented 4 years ago
  1. 可以跑通spark on angel 的lr示例
  2. 无法跑通pytorch on angel的 deepfm示例

实际运行中,发现,起一个spark进程,两个 angel ps 进程。并且有一个angel ps进程会在spark进程之前结束。然后,运行日志中提示,send stop command to Master failed

求尽快解答!

2019-09-05 15:13:06 ERROR AngelClient:480 - send stop command to Master failed com.google.protobuf.ServiceException: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /13.190.232.43:21029 at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:317) at com.sun.proxy.$Proxy25.stop(Unknown Source) at com.tencent.angel.client.AngelClient.stop(AngelClient.java:477) at com.tencent.angel.client.AngelPSClient.stopPS(AngelPSClient.java:181) at com.tencent.angel.spark.context.AngelPSContext$.doStop(AngelPSContext.scala:441) at com.tencent.angel.spark.context.AngelPSContext$$anon$2.run(AngelPSContext.scala:323) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Error connecting to /10.110.132.43:21029 at com.tencent.angel.ipc.CallFuture.get(CallFuture.java:121) at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:297) at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:294) ... 6 more Caused by: java.io.IOException: Error connecting to /10.110.132.43:21029 at com.tencent.angel.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:149) at com.tencent.angel.ipc.NettyTransceiver.transceive(NettyTransceiver.java:338) at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:292) ... 7 more Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.110.132.43:21029 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.ConnectException: Connection refused ... 11 more 2019-09-05 15:13:06 INFO YarnClientImpl:395 - Killed application application_1567663579362_0002 End of LogType:stdout