The TopologyProvider throw exception which is not caught, which results in the healthmgr restarts. I think this behavior could be improved.
The statemgrclient inside TopologyProvider tries to fetch topology information in statemgr, however the topology info is not available at that time. After several attempts (in the log 10 attempts), the topology info refreshes in the statemgr, then healthmgr stops restarting.
healthmgr log:
[2017-09-29 22:58:40 +0000] [INFO] com.twitter.heron.healthmgr.HealthManager: Starting Health Manager
[2017-09-29 22:58:40 +0000] [INFO] com.microsoft.dhalion.policy.PoliciesExecutor: Executing Policy: AutoRestartBackpressureContainerPolicy
[2017-09-29 22:58:40 +0000] [INFO] com.twitter.heron.healthmgr.common.TopologyProvider: Fetching topology from state manager: dhalionTopo
[2017-09-29 22:58:40 +0000] [WARNING] com.twitter.heron.spi.statemgr.SchedulerStateManagerAdaptor: Exception processing future: java.lang.RuntimeException: Failed to fetch data from path: xxxxxxxxxx
[2017-09-29 22:58:40 +0000] [STDERR] stderr: Exception in thread "main"
[2017-09-29 22:58:40 +0000] [STDERR] stderr: java.util.concurrent.ExecutionException: java.lang.NullPointerException
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.concurrent.FutureTask.report(FutureTask.java:122)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.concurrent.FutureTask.get(FutureTask.java:192)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at com.twitter.heron.healthmgr.HealthManager.main(HealthManager.java:218)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: Caused by: java.lang.NullPointerException
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at com.twitter.heron.healthmgr.common.TopologyProvider.fetchLatestTopology(TopologyProvider.java:67)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at com.twitter.heron.healthmgr.common.TopologyProvider.get(TopologyProvider.java:60)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at com.twitter.heron.healthmgr.common.TopologyProvider.getBoltNames(TopologyProvider.java:88)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at com.twitter.heron.healthmgr.sensors.BackPressureSensor.get(BackPressureSensor.java:66)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at com.twitter.heron.healthmgr.detectors.BackPressureDetector.detect(BackPressureDetector.java:59)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at com.microsoft.dhalion.policy.HealthPolicyImpl.lambda$executeDetectors$3(HealthPolicyImpl.java:97)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at com.microsoft.dhalion.policy.HealthPolicyImpl.executeDetectors(HealthPolicyImpl.java:99)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at com.microsoft.dhalion.policy.PoliciesExecutor.lambda$start$1(PoliciesExecutor.java:52)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2017-09-29 22:58:40 +0000] [STDERR] stderr: at java.lang.Thread.run(Thread.java:748)
[2017-09-29 22:58:51 +0000] [INFO] com.twitter.heron.healthmgr.HealthManager: Logging setup done.
[2017-09-29 22:58:51 +0000] [INFO] com.twitter.heron.healthmgr.HealthManager: Static Heron config loaded successfully
heron-executor log:
[2017-09-29 22:58:15 +0000] [INFO]: heron-healthmgr (pid=19572) exited with status 256. command=['xxxxxx/bin/java', '-Xmx1024M', '-XX:+PrintCommandLineFlags', '-verbosegc', '-XX:xxxxxxx', '-Xloggc:log-files/gc.healthmgr.log', '-Djava.net.preferIPv4Stack=true', '-cp', './heron-core/lib/scheduler/*:./heron-core/lib/packing/*:./heron-core/lib/statemgr/*:./heron-core/lib/healthmgr/heron-healthmgr.jar', 'com.twitter.heron.healthmgr.HealthManager',xxxxxxxx]
The TopologyProvider throw exception which is not caught, which results in the healthmgr restarts. I think this behavior could be improved.
The statemgrclient inside TopologyProvider tries to fetch topology information in statemgr, however the topology info is not available at that time. After several attempts (in the log 10 attempts), the topology info refreshes in the statemgr, then healthmgr stops restarting.
healthmgr log:
heron-executor log: