Angel-ML / angel

A Flexible and Powerful Parameter Server for large-scale machine learning
Other
6.75k stars 1.6k forks source link

使用node2vec_randomwallk的结果测试 cluster/Word2vecExample.scala,OOM #1196

Open luoyeguigen001 opened 3 years ago

luoyeguigen001 commented 3 years ago

测试cluster/cluster/Word2vecExample.scala,使用的数据是基于ogbn-100M的随机游走,该数据集大约有1.1亿个节点; 游走的参数是:epoch=1, step=10,即每个节点随机游走一次,每次长度是10。

使用master分支的代码,测试spark-on-angel下的cluster/cluster/Word2vecExample.scala,

启动脚本如下:

source ./bin/spark-on-angel-env.sh $SPARK_HOME/bin/spark-submit \ --master yarn-cluster\ --conf spark.ps.instances=6 \ --conf spark.ps.cores=4 \ --conf spark.ps.jars=$SONA_ANGEL_JARS \ --conf spark.ps.memory=10g \ --jars $SONA_SPARK_JARS \ --driver-memory 20g \ --num-executors 6 \ --verbose \ --executor-cores 4 \ --executor-memory 15g \ --conf spark.default.parallelism=100 \ --class com.tencent.angel.spark.examples.cluster.Word2vecExample \ --conf spark.hadoop.fs.viewfs.inner.cache=false \ lib/spark-on-angel-examples-3.2.0.jar \ input:$input output:$output embedding:32 negative:3 epoch:1 saveModelInterval:1 stepSize:0.01 batchSize:50 psPartitionNum:500 dataPartitionNum:120 remapping:false window:5 saveContextEmbedding:true

日志中显示的参数: driver日志最大id: param embedding = 32 param negative = 3 param epoch = 1 param saveModelInterval = 1 param stepSize = 0.01 param batchSize = 50 param psPartitionNum = 500 param dataPartitionNum = 120 param remapping = false param window = 5 param saveContextEmbedding = true dataPartitionNum=120 numDocs=111059956 minWordId=0 maxWordId=111059956 numTokens=1110599560 maxLength=10

但是一直有oom的报错 21/11/24 17:14:47 INFO RunningContext: =====================Server running context start======================= 21/11/24 17:14:47 INFO RunningContext: state = IDLE 21/11/24 17:14:47 INFO RunningContext: totalRunningRPCCounter = 1 21/11/24 17:14:47 INFO RunningContext: infligtingRPCCounter = 0 21/11/24 17:14:47 INFO RunningContext: oomCounter = 0 21/11/24 17:14:47 INFO RunningContext: maxRunningRPCCounter = 10000 21/11/24 17:14:47 INFO RunningContext: generalRunningRPCCounter = 6000 21/11/24 17:14:47 INFO RunningContext: lastOOMRunningRPCCounter = 0 21/11/24 17:14:47 INFO RunningContext: totalRPCCounter = 110 21/11/24 17:14:47 INFO RunningContext: total=1092 21/11/24 17:14:47 INFO RunningContext: normal=1075 21/11/24 17:14:47 INFO RunningContext: network=0 21/11/24 17:14:47 INFO RunningContext: channelInUseCounter=0 21/11/24 17:14:47 INFO RunningContext: oom=0 21/11/24 17:14:47 INFO RunningContext: unknown=0 21/11/24 17:14:47 INFO RunningContext: =====================Server running context end ======================= 21/11/24 17:14:47 INFO RunningContext: =====================Server running context start======================= 21/11/24 17:14:47 INFO RunningContext: state = IDLE 21/11/24 17:14:47 INFO RunningContext: totalRunningRPCCounter = 2 21/11/24 17:14:47 INFO RunningContext: infligtingRPCCounter = 0 21/11/24 17:14:47 INFO RunningContext: oomCounter = 0 21/11/24 17:14:47 INFO RunningContext: maxRunningRPCCounter = 10000 21/11/24 17:14:47 INFO RunningContext: generalRunningRPCCounter = 6000

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "heartbeatThread" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Thread-3" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RPCWorker-thread-6" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "heartbeatThread" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "nioEventLoopGroup-29-5" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "heartbeatThread" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "client-heartbeat" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "dispatcher-event-loop-55" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "dispatcher-event-loop-48" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RPCResponser-thread-5" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RPCResponser-thread-9" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RPCResponser-thread-8" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "RPCResponser-thread-10" Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "dispatcher-event-loop-0"

ouyangwen-it commented 3 years ago

看一下ps的gc日志页面,有频繁的full gc日志吗

luoyeguigen001 commented 3 years ago

21/11/24 17:14:47 INFO RunningContext: =====================Server running context start======================= 21/11/24 17:14:47 INFO RunningContext: state = IDLE 21/11/24 17:14:47 INFO RunningContext: totalRunningRPCCounter = 1 21/11/24 17:14:47 INFO RunningContext: infligtingRPCCounter = 0 21/11/24 17:14:47 INFO RunningContext: oomCounter = 0 21/11/24 17:14:47 INFO RunningContext: maxRunningRPCCounter = 10000 21/11/24 17:14:47 INFO RunningContext: generalRunningRPCCounter = 6000 21/11/24 17:14:47 INFO RunningContext: lastOOMRunningRPCCounter = 0 21/11/24 17:14:47 INFO RunningContext: totalRPCCounter = 110 21/11/24 17:14:47 INFO RunningContext: total=1092 21/11/24 17:14:47 INFO RunningContext: normal=1075 21/11/24 17:14:47 INFO RunningContext: network=0 21/11/24 17:14:47 INFO RunningContext: channelInUseCounter=0 21/11/24 17:14:47 INFO RunningContext: oom=0 21/11/24 17:14:47 INFO RunningContext: unknown=0 21/11/24 17:14:47 INFO RunningContext: =====================Server running context end =======================

没有找到其他的地方提示full gc的

ouyangwen-it commented 3 years ago

21/11/24 17:14:47 INFO RunningContext: =====================Server running context start======================= 21/11/24 17:14:47 INFO RunningContext: state = IDLE 21/11/24 17:14:47 INFO RunningContext: totalRunningRPCCounter = 1 21/11/24 17:14:47 INFO RunningContext: infligtingRPCCounter = 0 21/11/24 17:14:47 INFO RunningContext: oomCounter = 0 21/11/24 17:14:47 INFO RunningContext: maxRunningRPCCounter = 10000 21/11/24 17:14:47 INFO RunningContext: generalRunningRPCCounter = 6000 21/11/24 17:14:47 INFO RunningContext: lastOOMRunningRPCCounter = 0 21/11/24 17:14:47 INFO RunningContext: totalRPCCounter = 110 21/11/24 17:14:47 INFO RunningContext: total=1092 21/11/24 17:14:47 INFO RunningContext: normal=1075 21/11/24 17:14:47 INFO RunningContext: network=0 21/11/24 17:14:47 INFO RunningContext: channelInUseCounter=0 21/11/24 17:14:47 INFO RunningContext: oom=0 21/11/24 17:14:47 INFO RunningContext: unknown=0 21/11/24 17:14:47 INFO RunningContext: =====================Server running context end =======================

没有找到其他的地方提示full gc的

不是这个页面,这个是sys.log页面,跟它同一级有个gc.log页面

ouyangwen-it commented 3 years ago

你现在增加ps内存可以运行成功吗

luoyeguigen001 commented 3 years ago

没有成功,

我把参数调整了一些 source ./bin/spark-on-angel-env.sh $SPARK_HOME/bin/spark-submit \ --master yarn-cluster\ --conf spark.ps.instances=8 \ --conf spark.ps.cores=4 \ --conf spark.ps.jars=$SONA_ANGEL_JARS \ --conf spark.ps.memory=10g \ --jars $SONA_SPARK_JARS \ --driver-memory 30g \ --num-executors 8 \ --verbose \ --executor-cores 4 \ --executor-memory 15g \ --conf spark.default.parallelism=5000 \ --class com.tencent.angel.spark.examples.cluster.Word2vecExample \ --conf spark.network.timeout=1200000 \ --queue xxx \ --conf spark.hadoop.fs.viewfs.inner.cache=false \ lib/spark-on-angel-examples-3.2.0.jar \ angel.netty.matrixtransfer.max.message.size 10737418240 \ input:$input output:$output embedding:32 negative:3 epoch:1 saveModelInterval:1 stepSize:0.01 batchSize:200 psPartitionNum:5000 dataPartitionNum:120 remapping:false window:5 saveContextEmbedding:true

报了如下的错误: 21/11/25 16:29:50 ERROR ParameterServer: ERROR IN CONTACTING RM. com.google.protobuf.ServiceException: java.util.concurrent.ExecutionException: java.io.IOException: NettyTransceiver closed at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:317) at com.sun.proxy.$Proxy34.psReport(Unknown Source) at com.tencent.angel.ps.client.MasterClient.psReport(MasterClient.java:191) at com.tencent.angel.ps.ParameterServer.heartbeat(ParameterServer.java:585) at com.tencent.angel.ps.ParameterServer.lambda$startHeartbeat$3(ParameterServer.java:497) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.ExecutionException: java.io.IOException: NettyTransceiver closed at com.tencent.angel.ipc.CallFuture.get(CallFuture.java:121) at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:297) at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:294) ... 5 more Caused by: java.io.IOException: NettyTransceiver closed at com.tencent.angel.ipc.NettyTransceiver.disconnect(NettyTransceiver.java:216) at com.tencent.angel.ipc.NettyTransceiver.access$300(NettyTransceiver.java:57) at com.tencent.angel.ipc.NettyTransceiver$MLClientMLHandler.channelInactive(NettyTransceiver.java:446) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:377) at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:917) at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) ... 1 more

ouyangwen-it commented 3 years ago

angel.netty.matrixtransfer.max.message.size

angel.netty.matrixtransfer.max.message.size,这个参数这么用--conf spark.hadoop.angel.netty.matrixtransfer.max.message.size=

ouyangwen-it commented 3 years ago

image 另外,ps分区数你调小到80,batchsize调小到50 重新提交下

luoyeguigen001 commented 3 years ago

刚刚跑了,也有错,但是其他的错误

21/11/25 18:22:54 WARN KafkaClient: callback,kafka send metrics failed, org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for bethune_jobmetrics_test-68:280336 ms has passed since batch creation 21/11/25 18:22:54 ERROR MatrixTransportClient: Request Request{hashCode=7229, header=RequestHeader{clientId=0, token=0, userRequestId=91, seqId=7212, methodId=7, matrixId=1, partId=28, handleElemNum=0}, context=com.tencent.angel.ps.server.data.request.RequestContext@3a58bb9} to PS ParameterServer_4 not return result over 120000 milliseconds 21/11/25 18:22:54 ERROR MatrixTransportClient: Request Request{hashCode=7237, header=RequestHeader{clientId=0, token=0, userRequestId=91, seqId=7213, methodId=7, matrixId=1, partId=36, handleElemNum=0}, context=com.tencent.angel.ps.server.data.request.RequestContext@7c13af} to PS ParameterServer_4 not return result over 120000 milliseconds 21/11/25 18:22:54 ERROR MatrixTransportClient: Request Request{hashCode=7245, header=RequestHeader{clientId=0, token=0, userRequestId=91, seqId=7214, methodId=7, matrixId=1, partId=44, handleElemNum=0}, context=com.tencent.angel.ps.server.data.request.RequestContext@6176256c} to PS ParameterServer_4 not return result over 120000 milliseconds 21/11/25 18:22:54 ERROR MatrixTransportClient: Request Request{hashCode=7253, header=RequestHeader{clientId=0, token=0, userRequestId=91, seqId=7215, methodId=7, matrixId=1, partId=52, handleElemNum=0}, context=com.tencent.angel.ps.server.data.request.RequestContext@1b052319} to PS ParameterServer_4 not return result over 120000 milliseconds 21/11/25 18:22:54 ERROR MatrixTransportClient: Request Request{hashCode=7261, header=RequestHeader{clientId=0, token=0, userRequestId=91, seqId=7216, methodId=7, matrixId=1, partId=60, handleElemNum=0}, context=com.tencent.angel.ps.server.data.request.RequestContext@34528b83} to PS ParameterServer_4 not return result over 120000 milliseconds 21/11/25 18:22:54 ERROR MatrixTransportClient: Request Request{hashCode=7206, header=RequestHeader{clientId=0, token=0, userRequestId=91, seqId=7217, methodId=7, matrixId=1, partId=5, handleElemNum=0}, context=com.tencent.angel.ps.server.data.request.RequestContext@545307cd} to PS ParameterServer_5 not return result over 120000 milliseconds 21/11/25 18:22:54 ERROR MatrixTransportClient: Request Request{hashCode=7214, header=RequestHeader{clientId=0, token=0, userRequestId=91, seqId=7218, methodId=7, matrixId=1, partId=13, handleElemNum=0}, context=com.tencent.angel.ps.server.data.request.RequestContext@45e056e} to PS ParameterServer_5 not return result over 120000 milliseconds 21/11/25 18:22:54 ERROR MatrixTransportClient: Request Request{hashCode=7238, header=RequestHeader{clientId=0, token=0, userRequestId=91, seqId=7221, methodId=7, matrixId=1, partId=37, handleElemNum=0}, context=com.tencent.angel.ps.server.data.request.RequestContext@5a7e68b0} to PS ParameterServer_5 not return result over 120000 milliseconds 21/11/25 18:22:54 ERROR MatrixTransportClient: Request Request{hashCode=7246, header=RequestHeader{clientId=0, token=0, userRequestId=91, seqId=7222, methodId=7, matrixId=1, partId=45, handleElemNum=0}, context=com.tencent.angel.ps.server.data.request.RequestContext@2e7f73e4} to PS ParameterServer_5 not return result over 120000 milliseconds

21/11/25 19:05:19 INFO RetryInvocationHandler: Exception while invoking allocate of class ApplicationMasterProtocolPBClientImpl over rm1. Trying to fail over immediately. java.io.IOException: Failed on local exception: java.io.IOException: java.io.IOException: Broken pipe; Host Details : local host is: "xxxxx"; destination host is: "xxxx":xxx; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1519) at org.apache.hadoop.ipc.Client.call(Client.java:1450) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy22.allocate(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:257) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy23.allocate(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:295) at org.apache.spark.deploy.yarn.YarnAllocator.allocateResources(YarnAllocator.scala:249) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:661) Caused by: java.io.IOException: java.io.IOException: Broken pipe at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:727) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:690) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:778) at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:397) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1571) at org.apache.hadoop.ipc.Client.call(Client.java:1489) ... 13 more Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.security.SaslRpcClient.sendSaslMessage(SaslRpcClient.java:463) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:362) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:597) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:397) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:770) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:766) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:765) ... 16 more

luoyeguigen001 commented 3 years ago

上面报错的启动脚本是: source ./bin/spark-on-angel-env.sh $SPARK_HOME/bin/spark-submit \ --master yarn-cluster\ --conf spark.ps.instances=8 \ --conf spark.ps.cores=4 \ --conf spark.ps.jars=$SONA_ANGEL_JARS \ --conf spark.ps.memory=10g \ --jars $SONA_SPARK_JARS \ --driver-memory 30g \ --num-executors 8 \ --verbose \ --executor-cores 4 \ --executor-memory 15g \ --conf spark.default.parallelism=5000 \ --conf spark.hadoop.angel.netty.matrixtransfer.max.message.size=1073741824 \ --class com.tencent.angel.spark.examples.cluster.Word2vecExample \ --conf spark.network.timeout=1200000 \ --conf spark.hadoop.fs.viewfs.inner.cache=false \ lib/spark-on-angel-examples-3.2.0.jar \ input:$input output:$output embedding:32 negative:3 epoch:1 saveModelInterval:1 stepSize:0.01 batchSize:200 psPartitionNum:80 dataPartitionNum:50 remapping:false window:5 saveContextEmbedding:true

ouyangwen-it commented 3 years ago

上面报错的启动脚本是: source ./bin/spark-on-angel-env.sh $SPARK_HOME/bin/spark-submit --master yarn-cluster --conf spark.ps.instances=8 --conf spark.ps.cores=4 --conf spark.ps.jars=$SONA_ANGEL_JARS --conf spark.ps.memory=10g --jars $SONA_SPARK_JARS --driver-memory 30g --num-executors 8 --verbose --executor-cores 4 --executor-memory 15g --conf spark.default.parallelism=5000 --conf spark.hadoop.angel.netty.matrixtransfer.max.message.size=1073741824 --class com.tencent.angel.spark.examples.cluster.Word2vecExample --conf spark.network.timeout=1200000 --conf spark.hadoop.fs.viewfs.inner.cache=false lib/spark-on-angel-examples-3.2.0.jar input:$input output:$output embedding:32 negative:3 epoch:1 saveModelInterval:1 stepSize:0.01 batchSize:200 psPartitionNum:80 dataPartitionNum:50 remapping:false window:5 saveContextEmbedding:true

image batch size你没调小吗

luoyeguigen001 commented 3 years ago

调小了测试的脚本: source ./bin/spark-on-angel-env.sh $SPARK_HOME/bin/spark-submit \ --master yarn-cluster\ --conf spark.ps.instances=8 \ --conf spark.ps.cores=4 \ --conf spark.ps.jars=$SONA_ANGEL_JARS \ --conf spark.ps.memory=12g \ --jars $SONA_SPARK_JARS \ --driver-memory 40g \ --num-executors 8 \ --verbose \ --executor-cores 4 \ --executor-memory 15g \ --conf spark.hadoop.angel.netty.matrixtransfer.max.message.size=1073741824 \ --class com.tencent.angel.spark.examples.cluster.Word2vecExample \ --conf spark.network.timeout=1200000 \ --queue root.offline.hdp.test1.product \ lib/spark-on-angel-examples-3.2.0.jar \ input:$input output:$output embedding:32 negative:3 epoch:1 saveModelInterval:1 stepSize:0.01 batchSize:50 psPartitionNum:80 dataPartitionNum:120 remapping:false window:5 saveContextEmbedding:true

但是仍然报了下列的错误: ERROR MatrixTransportClient: Request Request{hashCode=8803, header=RequestHeader{clientId=0, token=0, userRequestId=111, seqId=8833, methodId=7, matrixId=1, partId=2, handleElemNum=0}, context=com.tencent.angel.ps.server.data.request.RequestContext@61e96c0b} to PS ParameterServer_2 not return result over 120000 milliseconds

21/11/26 08:56:18 ERROR NettyTransceiver: Missing previous call info, serail: 3732, channel: [id: ] 21/11/26 08:56:18 ERROR AngelClient: AngelClient 1 send heartbeat to Master failed com.google.protobuf.ServiceException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Missing previous call info at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:317) at com.sun.proxy.$Proxy32.keepAlive(Unknown Source) at com.tencent.angel.client.AngelClient.lambda$startHeartbeat$0(AngelClient.java:218) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Missing previous call info at com.tencent.angel.ipc.CallFuture.get(CallFuture.java:121) at com.tencent.angel.ipc.NettyTransceiver.call(NettyTransceiver.java:297) at com.tencent.angel.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:294) ... 3 more Caused by: java.lang.RuntimeException: Missing previous call info at com.tencent.angel.ipc.NettyTransceiver$MLClientMLHandler.channelRead(NettyTransceiver.java:469) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) ... 1 more

这个是否因为调用超时引起的异常?

ouyangwen-it commented 3 years ago

image 看一下angel master日志有没有错误信息,看着超时挂了

luoyeguigen001 commented 3 years ago

是这个么,ERROR MatrixTransportClient: Request Request{hashCode=8803, header=RequestHeader{clientId=0, token=0, userRequestId=111, seqId=8833, methodId=7, matrixId=1, partId=2, handleElemNum=0}, context=com.tencent.angel.ps.server.data.request.RequestContext@61e96c0b} to PS ParameterServer_2 not return result over 120000 milliseconds,我在上面有贴出来

ouyangwen-it commented 3 years ago

再看一下spark executor日志,看看一个batch拉取的耗时多少

luoyeguigen001 commented 3 years ago

是下面的这个日志么: 21/11/26 09:11:51 INFO TaskSetManager: Finished task 45.0 in stage 0.0 (TID 45) in 17699 ms on (executor 7) (42/56) 21/11/26 09:11:51 INFO TaskSetManager: Finished task 33.0 in stage 0.0 (TID 33) in 22538 ms on (43/56) 21/11/26 09:11:52 INFO TaskSetManager: Finished task 42.0 in stage 0.0 (TID 42) in 22319 ms on (44/56) 21/11/26 09:11:52 INFO TaskSetManager: Finished task 47.0 in stage 0.0 (TID 47) in 17658 ms on (45/56) 21/11/26 09:11:52 INFO TaskSetManager: Finished task 52.0 in stage 0.0 (TID 52) in 16114 ms on (46/56) 21/11/26 09:11:53 INFO TaskSetManager: Finished task 49.0 in stage 0.0 (TID 49) in 17474 ms on (executor 7) (47/56) 21/11/26 09:11:53 INFO TaskSetManager: Finished task 53.0 in stage 0.0 (TID 53) in 16199 ms on (executor 7) (48/56) 21/11/26 09:11:53 INFO TaskSetManager: Finished task 50.0 in stage 0.0 (TID 50) in 17486 ms on (executor 4) (49/56) 21/11/26 09:11:54 INFO TaskSetManager: Finished task 44.0 in stage 0.0 (TID 44) in 21329 ms on(executor 6) (50/56) 21/11/26 09:11:55 INFO TaskSetManager: Finished task 46.0 in stage 0.0 (TID 46) in 21976 ms on (executor 6) (51/56)

ouyangwen-it commented 3 years ago

是下面的这个日志么: 21/11/26 09:11:51 INFO TaskSetManager: Finished task 45.0 in stage 0.0 (TID 45) in 17699 ms on (executor 7) (42/56) 21/11/26 09:11:51 INFO TaskSetManager: Finished task 33.0 in stage 0.0 (TID 33) in 22538 ms on (43/56) 21/11/26 09:11:52 INFO TaskSetManager: Finished task 42.0 in stage 0.0 (TID 42) in 22319 ms on (44/56) 21/11/26 09:11:52 INFO TaskSetManager: Finished task 47.0 in stage 0.0 (TID 47) in 17658 ms on (45/56) 21/11/26 09:11:52 INFO TaskSetManager: Finished task 52.0 in stage 0.0 (TID 52) in 16114 ms on (46/56) 21/11/26 09:11:53 INFO TaskSetManager: Finished task 49.0 in stage 0.0 (TID 49) in 17474 ms on (executor 7) (47/56) 21/11/26 09:11:53 INFO TaskSetManager: Finished task 53.0 in stage 0.0 (TID 53) in 16199 ms on (executor 7) (48/56) 21/11/26 09:11:53 INFO TaskSetManager: Finished task 50.0 in stage 0.0 (TID 50) in 17486 ms on (executor 4) (49/56) 21/11/26 09:11:54 INFO TaskSetManager: Finished task 44.0 in stage 0.0 (TID 44) in 21329 ms on(executor 6) (50/56) 21/11/26 09:11:55 INFO TaskSetManager: Finished task 46.0 in stage 0.0 (TID 46) in 21976 ms on (executor 6) (51/56)

不是这个,这个是spark driver日志,看下spark executor日志(任意找一个)

luoyeguigen001 commented 3 years ago

打了日志,发现还没到训练的地方,直接在model.randomInitialize,初始化embedding的时候就异常了