apache / incubator-heron

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter
https://heron.apache.org/
Apache License 2.0
3.64k stars 598 forks source link

can not submit to yarn in some centos machine! Heron bug? Please check my comments #3500

Open dttlgotv opened 4 years ago

dttlgotv commented 4 years ago

issue detail:

  1. Same heron version(I compiled using last month codes), same hadoop version: 3.2.1, almost same hadoop config, same heron topology

  2. submit to yarn always well on mac sometimes can not submit to yarn cluster on three lab centos machine can not submit to yarn on another company centos machine always.

This issue has blocked me for some days, and I have to change to use other cluster.....

My suspect:

  1. heron protobuf data(version 3.6.1) is not compatible with hadoop protobuf(version 2.5.0), currently I just add 3.6.1 protobuf to external path when submitting to yarn.

Please help me to check the error below, other log seems no any hint.

The error is below:

[2020-03-25 10:36:38 +0800] [信息] org.apache.heron.packing.roundrobin.RoundRobinPacking: Pack internal: container CPU hint: 2.000, RAM hint: ByteAmount{1.0 GB (1073741824 bytes)}, disk hint: ByteAmount{-1 bytes}.
[2020-03-25 10:36:38 +0800] [信息] org.apache.heron.packing.roundrobin.RoundRobinPacking: Pack internal finalized: container#1 CPU: 2.000000, RAM: ByteAmount{1.0 GB (1073741824 bytes)}, disk: ByteAmount{13.0 GB (13958643712 bytes)}.
[2020-03-25 10:36:38 +0800] [信息] org.apache.heron.packing.roundrobin.RoundRobinPacking: Initalizing RoundRobinPacking. CPU default: 1.000000, RAM default: ByteAmount{1.0 GB (1073741824 bytes)}, DISK default: ByteAmount{1.0 GB (1073741824 bytes)}, RAM padding: ByteAmount{2.0 GB (2147483648 bytes)}.
[2020-03-25 10:36:38 +0800] [警告] org.apache.heron.packing.roundrobin.RoundRobinPacking: Container#1 (max RAM: ByteAmount{1.0 GB (1073741824 bytes)}) is now hosting instances that take up to ByteAmount{0 bytes} RAM. The container may not have enough resource to accommodate internal processes which take up to ByteAmount{2.0 GB (2147483648 bytes)} RAM.
[2020-03-25 10:36:38 +0800] [信息] org.apache.heron.packing.roundrobin.RoundRobinPacking: Pack internal: container CPU hint: 2.000, RAM hint: ByteAmount{1.0 GB (1073741824 bytes)}, disk hint: ByteAmount{-1 bytes}.
[2020-03-25 10:36:38 +0800] [信息] org.apache.heron.packing.roundrobin.RoundRobinPacking: Pack internal finalized: container#1 CPU: 2.000000, RAM: ByteAmount{1.0 GB (1073741824 bytes)}, disk: ByteAmount{13.0 GB (13958643712 bytes)}.
[2020-03-25 10:36:38 +0800] [信息] org.apache.heron.scheduler.yarn.YarnLauncher: Initializing topology: Test3Topology, core: /root/.heron/dist/heron-core.tar.gz
[2020-03-25 10:36:38 +0800] [信息] org.apache.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/topologies/Test3Topology
[2020-03-25 10:36:38 +0800] [信息] org.apache.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/packingplans/Test3Topology
[2020-03-25 10:36:38 +0800] [信息] org.apache.heron.statemgr.zookeeper.curator.CuratorStateManager: Created node for path: /heron/executionstate/Test3Topology
[2020-03-25 10:36:38 +0800] [严重] org.apache.reef.runtime.yarn.YarnClasspathProvider: YarnConfiguration.YARN_APPLICATION_CLASSPATH is empty. This indicates a broken cluster configuration.
2020-03-25 10:36:38,705 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable [2020-03-25 10:36:39 +0800] [信息] org.apache.reef.util.REEFVersion: REEF Version: 0.14.0
[2020-03-25 10:36:39 +0800] [信息] org.apache.heron.scheduler.yarn.ReefClientSideHandlers: Initializing REEF client handlers for Heron, topology: Test3Topology
[INFO] RMProxy - Connecting to ResourceManager at guoxinghua1/127.0.0.1:8032 [2020-03-25 10:36:51 +0800] [警告] org.apache.reef.runtime.common.files.JobJarMaker: Failed to delete [/tmp/reef-job-1836122165165029413]
2020-03-25 10:36:54,247 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2020-03-25 10:36:54,666 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2020-03-25 10:36:54,988 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false 2020-03-25 10:36:55,149 INFO conf.Configuration: resource-types.xml not found 2020-03-25 10:36:55,149 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. [2020-03-25 10:36:55 +0800] [信息] org.apache.reef.runtime.yarn.client.YarnSubmissionHelper: Submitting REEF Application to YARN. ID: application_1585102108714_0002
2020-03-25 10:36:55,210 INFO impl.YarnClientImpl: Submitted application application_1585102108714_0002 [2020-03-25 10:36:59 +0800] [信息] org.apache.heron.scheduler.yarn.ReefClientSideHandlers: Topology Test3Topology is running, jobId Test3Topology.
[2020-03-25 10:36:59 +0800] [信息] org.apache.heron.statemgr.zookeeper.curator.CuratorStateManager: Closing the CuratorClient to: 127.0.0.1:2181
2020-03-25 10:36:59,098 INFO imps.CuratorFrameworkImpl: backgroundOperationsLoop exiting 2020-03-25 10:36:59,104 INFO zookeeper.ZooKeeper: Session: 0x1000030d5e70002 closed [2020-03-25 10:36:59 +0800] [信息] org.apache.heron.statemgr.zookeeper.curator.CuratorStateManager: Closing the tunnel processes
2020-03-25 10:36:59,104 INFO zookeeper.ClientCnxn: EventThread shut down for session: 0x1000030d5e70002 [2020-03-25 10:37:04 +0800] [警告] org.apache.reef.runtime.common.client.RuntimeErrorProtoHandler: socket://127.0.0.1:52988 Runtime Error: com.google.protobuf.Descriptors$Descriptor.getOneofs()Ljava/util/List;
[2020-03-25 10:37:04 +0800] [严重] org.apache.heron.scheduler.yarn.ReefClientSideHandlers: Failed to start topology: Test3Topology
[2020-03-25 10:37:04 +0800] [警告] org.apache.reef.runtime.common.client.RuntimeErrorProtoHandler: socket://127.0.0.1:52990 Runtime Error: Thread main threw an uncaught exception.
[2020-03-25 10:37:04 +0800] [严重] org.apache.heron.scheduler.yarn.ReefClientSideHandlers: Failed to start topology: Test3Topology

dttlgotv commented 4 years ago

detail information:

2020-03-27 17:27:15 +0800] [信息] org.apache.heron.statemgr.zookeeper.curator.CuratorStateManager: Closing the tunnel processes
[2020-03-27 17:27:15 +0800] [详细] org.apache.heron.scheduler.SubmitterMain: Topology Test3Topology submitted successfully
2020-03-27 17:27:15,384 INFO zookeeper.ClientCnxn: EventThread shut down for session: 0x100000272cf0001 [2020-03-27 17:27:18 +0800] [非常详细] org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener: Channel active. key: /127.0.0.1:45024
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener: Add connected channel ref: org.apache.reef.wake.remote.transport.netty.LinkReference@69c97c5e
[2020-03-27 17:27:18 +0800] [非常详细] org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener: MessageEvent: local: /127.0.0.1:17758 remote: /127.0.0.1:45024 :: [B@2ccbef03
[2020-03-27 17:27:18 +0800] [非常详细] org.apache.reef.wake.remote.impl.OrderedRemoteReceiverStage: org.apache.reef.wake.remote.impl.TransportEvent@47a815d3
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.OrderedPushEventHandler: org.apache.reef.wake.remote.impl.TransportEvent@47a815d3 RemoteEvent localAddr=/127.0.0.1:17758 remoteAddr=/127.0.0.1:45024 seq=0 event=[B@8d13f64
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.OrderedPushEventHandler: Value length is 2,854
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.OrderedPullEventHandler: org.apache.reef.wake.remote.impl.OrderedEventStream@1bd3daf4
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.HandlerContainer: RemoteManager: REEF_CLIENT value: RemoteEvent localAddr=/127.0.0.1:17758 remoteAddr=/127.0.0.1:45024 seq=0 event=[B@8d13f64
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.HandlerContainer: Message handler: class org.apache.reef.proto.ReefServiceProtos$RuntimeErrorProto
[2020-03-27 17:27:18 +0800] [警告] org.apache.reef.runtime.common.client.RuntimeErrorProtoHandler: socket://127.0.0.1:45024 Runtime Error: com.google.protobuf.Descriptors$Descriptor.getOneofs()Ljava/util/List;
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener: Channel closed: [id: 0x49441df1, L:/127.0.0.1:17758 ! R:/127.0.0.1:45024]. Link ref found and removed: true
[2020-03-27 17:27:18 +0800] [严重] org.apache.heron.scheduler.yarn.ReefClientSideHandlers: Failed to start topology: Test3Topology
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.OrderedEventStream: Event is null
[2020-03-27 17:27:18 +0800] [非常详细] org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener: Channel active. key: /127.0.0.1:45026
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener: Add connected channel ref: org.apache.reef.wake.remote.transport.netty.LinkReference@62170841
[2020-03-27 17:27:18 +0800] [非常详细] org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener: MessageEvent: local: /127.0.0.1:17758 remote: /127.0.0.1:45026 :: [B@27158a67
[2020-03-27 17:27:18 +0800] [非常详细] org.apache.reef.wake.remote.impl.OrderedRemoteReceiverStage: org.apache.reef.wake.remote.impl.TransportEvent@3470935f
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.OrderedPushEventHandler: org.apache.reef.wake.remote.impl.TransportEvent@3470935f RemoteEvent localAddr=/127.0.0.1:17758 remoteAddr=/127.0.0.1:45026 seq=0 event=[B@637f050c
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.OrderedPushEventHandler: Value length is 3,362
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.OrderedPullEventHandler: org.apache.reef.wake.remote.impl.OrderedEventStream@12082d42
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.HandlerContainer: RemoteManager: REEF_CLIENT value: RemoteEvent localAddr=/127.0.0.1:17758 remoteAddr=/127.0.0.1:45026 seq=0 event=[B@637f050c
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.HandlerContainer: Message handler: class org.apache.reef.proto.ReefServiceProtos$RuntimeErrorProto
[2020-03-27 17:27:18 +0800] [警告] org.apache.reef.runtime.common.client.RuntimeErrorProtoHandler: socket://127.0.0.1:45026 Runtime Error: Thread main threw an uncaught exception.
[2020-03-27 17:27:18 +0800] [严重] org.apache.heron.scheduler.yarn.ReefClientSideHandlers: Failed to start topology: Test3Topology
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener: Channel closed: [id: 0xbd64de15, L:/127.0.0.1:17758 ! R:/127.0.0.1:45026]. Link ref found and removed: true
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener: Channel closed: [id: 0xdb655984, L:/127.0.0.1:17758 ! R:/127.0.0.1:45008]. Link ref found and removed: true

dttlgotv commented 4 years ago

Is it heron bug? 2845 length data cause error?

[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.OrderedPushEventHandler: Value length is 2,854
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.OrderedPullEventHandler: org.apache.reef.wake.remote.impl.OrderedEventStream@1bd3daf4
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.HandlerContainer: RemoteManager: REEF_CLIENT value: RemoteEvent localAddr=/127.0.0.1:17758 remoteAddr=/127.0.0.1:45024 seq=0 event=[B@8d13f64
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.impl.HandlerContainer: Message handler: class org.apache.reef.proto.ReefServiceProtos$RuntimeErrorProto
[2020-03-27 17:27:18 +0800] [警告] org.apache.reef.runtime.common.client.RuntimeErrorProtoHandler: socket://127.0.0.1:45024 Runtime Error: com.google.protobuf.Descriptors$Descriptor.getOneofs()Ljava/util/List;
[2020-03-27 17:27:18 +0800] [较详细] org.apache.reef.wake.remote.transport.netty.AbstractNettyEventListener: Channel closed: [id: 0x49441df1, L:/127.0.0.1:17758 ! R:/127.0.0.1:45024]. Link ref found and removed: true
[2020-03-27 17:27:18 +0800] [严重] org.apache.heron.scheduler.yarn.ReefClientSideHandlers: Failed to start topology: Test3Topology
[2020-03-27 17:27:18 +0800] [较详细] org.apache.