Qihoo360 / hbox

AI on Hadoop
Apache License 2.0
1.73k stars 385 forks source link

大任务运行超过半小时后,分布式程序异常退出 #65

Closed sydpz closed 5 months ago

sydpz commented 5 years ago

当前尝试把训练数据从7天增加至30天,并运行xlearning时,发现该程序在运行至99%时会异常退出。 想问下这个问题如何定位

出错时屏幕打印的错误日志为:

19/02/15 17:13:28 WARN Client: Connecting to ResourceManager failed, try again later
java.lang.reflect.UndeclaredThrowableException
        at com.sun.proxy.$Proxy23.fetchApplicationMessages(Unknown Source)
        at net.qihoo.xlearning.client.Client.waitCompleted(Client.java:732)
        at net.qihoo.xlearning.client.Client.submitAndMonitor(Client.java:693)
        at net.qihoo.xlearning.client.Client.main(Client.java:761)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.io.EOFException: End of File Exception between local host is: "XXX"; destination host is: "XXXXX":34590; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
        at org.apache.hadoop.ipc.Client.call(Client.java:1479)
        at org.apache.hadoop.ipc.Client.call(Client.java:1412)
        at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:243)
        ... 10 more
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1084)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:979)

通过 yarn logs -applicationId 看到貌似是一个心跳出错导致的。

retManager$InvalidToken): appattempt_1548050152824_0877_000001 not found in AMRMTokenSecretManager.
19/02/15 17:13:27 ERROR AMRMClientAsyncImpl: Exception on heartbeat
org.apache.hadoop.security.token.SecretManager$InvalidToken: appattempt_1548050152824_0877_000001 not found in AMRMTokenSecretManager.
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
        at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
        at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
        at com.sun.proxy.$Proxy12.allocate(Unknown Source)
        at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:277)
        at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): appattempt_1548050152824_0877_000001 not found in AMRMTokenSecretManager.
        at org.apache.hadoop.ipc.Client.call(Client.java:1471)
        at org.apache.hadoop.ipc.Client.call(Client.java:1408)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
        at com.sun.proxy.$Proxy11.allocate(Unknown Source)
        at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMa
jiayuhan-it commented 5 months ago

如有需要请重提issue