apache / incubator-heron

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter
https://heron.apache.org/
Apache License 2.0
3.65k stars 597 forks source link

topology provider causes healthmgr to restart at the launch #2368

Open huijunw opened 7 years ago

huijunw commented 7 years ago

The TopologyProvider throw exception which is not caught, which results in the healthmgr restarts. I think this behavior could be improved.

The statemgrclient inside TopologyProvider tries to fetch topology information in statemgr, however the topology info is not available at that time. After several attempts (in the log 10 attempts), the topology info refreshes in the statemgr, then healthmgr stops restarting.

healthmgr log:

[2017-09-29 22:58:40 +0000] [INFO] com.twitter.heron.healthmgr.HealthManager: Starting Health Manager  
[2017-09-29 22:58:40 +0000] [INFO] com.microsoft.dhalion.policy.PoliciesExecutor: Executing Policy: AutoRestartBackpressureContainerPolicy  
[2017-09-29 22:58:40 +0000] [INFO] com.twitter.heron.healthmgr.common.TopologyProvider: Fetching topology from state manager: dhalionTopo  
[2017-09-29 22:58:40 +0000] [WARNING] com.twitter.heron.spi.statemgr.SchedulerStateManagerAdaptor: Exception processing future: java.lang.RuntimeException: Failed to fetch data from path: xxxxxxxxxx 
[2017-09-29 22:58:40 +0000] [STDERR] stderr: Exception in thread "main"   
[2017-09-29 22:58:40 +0000] [STDERR] stderr: java.util.concurrent.ExecutionException: java.lang.NullPointerException  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.concurrent.FutureTask.report(FutureTask.java:122)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.concurrent.FutureTask.get(FutureTask.java:192)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at com.twitter.heron.healthmgr.HealthManager.main(HealthManager.java:218)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr: Caused by: java.lang.NullPointerException  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at com.twitter.heron.healthmgr.common.TopologyProvider.fetchLatestTopology(TopologyProvider.java:67)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at com.twitter.heron.healthmgr.common.TopologyProvider.get(TopologyProvider.java:60)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at com.twitter.heron.healthmgr.common.TopologyProvider.getBoltNames(TopologyProvider.java:88)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at com.twitter.heron.healthmgr.sensors.BackPressureSensor.get(BackPressureSensor.java:66)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at com.twitter.heron.healthmgr.detectors.BackPressureDetector.detect(BackPressureDetector.java:59)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at com.microsoft.dhalion.policy.HealthPolicyImpl.lambda$executeDetectors$3(HealthPolicyImpl.java:97)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at com.microsoft.dhalion.policy.HealthPolicyImpl.executeDetectors(HealthPolicyImpl.java:99)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at com.microsoft.dhalion.policy.PoliciesExecutor.lambda$start$1(PoliciesExecutor.java:52)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  
[2017-09-29 22:58:40 +0000] [STDERR] stderr:    at java.lang.Thread.run(Thread.java:748)  
[2017-09-29 22:58:51 +0000] [INFO] com.twitter.heron.healthmgr.HealthManager: Logging setup done.  
[2017-09-29 22:58:51 +0000] [INFO] com.twitter.heron.healthmgr.HealthManager: Static Heron config loaded successfully   

heron-executor log:

[2017-09-29 22:58:15 +0000] [INFO]: heron-healthmgr (pid=19572) exited with status 256. command=['xxxxxx/bin/java', '-Xmx1024M', '-XX:+PrintCommandLineFlags', '-verbosegc', '-XX:xxxxxxx', '-Xloggc:log-files/gc.healthmgr.log', '-Djava.net.preferIPv4Stack=true', '-cp', './heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/*:./heron-core/lib/healthmgr/heron-healthmgr.jar', 'com.twitter.heron.healthmgr.HealthManager',xxxxxxxx]
ashvina commented 7 years ago

I am investigating this issue now.