EnMasseProject / enmasse

EnMasse - Self-service messaging on Kubernetes and OpenShift
https://enmasseproject.github.io
Apache License 2.0
190 stars 87 forks source link

Slow Broker start-up suspected causing test failure in multi-node environment #2979

Open k-wall opened 5 years ago

k-wall commented 5 years ago

Test runs of an EnMasse 0.28 derivative were failing a Jenkins test run. A common theme is queued addresses failing to report ready in the standard address space with the broker reported outstanding.

The event logs show that the Broker's liveness probe is failing and for this reason the Broker was being restarted. Looking at the Broker pod log I see that the Broker was failing to reach the point where Jolokia was bound before the liveness probe kicked in at 2 minutes.

I notice that there are two significant areas of slowness prior to that:

Between these two log statements:

2019-07-08T22:28:03.104Z INFO  [PluginContextListener] Initialized artemis-plugin plugin
2019-07-08T22:29:20.080Z INFO  [ConfigManager] Configuration will be discovered via system properties

And before the ProxyWhiteList statement:

2019-07-08T22:29:27.912Z INFO  [JolokiaConfiguredAgentServlet] Jolokia overridden property: [key=policyLocation, value=file:/var/run/artemis/split-1//broker//etc/jolokia-access.xml]
2019-07-08T22:29:30.597Z INFO  [RBACMBeanInvoker] Using MBean [hawtio:type=security,area=jmx,rank=0,name=HawtioDummyJMXSecurity] for role based access control
2019-07-08T22:29:52.987Z INFO  [ProxyWhitelist] Initial proxy whitelist: [localhost, 127.0.0.1, 10.131.0.52, broker.l5sys5hykn-56f58cf888-w7vcs]

And finally a shutdown without reaching ready:

2019-07-08T22:29:59.388Z INFO [server] AMQ221002: Apache ActiveMQ Artemis Message Broker version 2.7.0.redhat-00056 [6757ac3d-a1cc-11e9-b033-0a580a830034] stopped, uptime 2 minutes

k-wall commented 5 years ago

This issue could be indicative of slow environment or a cluster with issues. Indeed, we noticed:

Warning CheckLimitsForResolvConf 2h (x19 over 2h) kubelet, internal-node1 Resolv.conf file '/etc/resolv.conf' contains search line consisting of more than 3 domains!"

With the latter message, it suggests a slow reverse DNS problem. I notice that Hawtio 2.5.0 has added a system property hawtio.localAddressProbing=false that can be used to turn off the probing of local addresses. However the that feature is not available in the version of Hawtio used by the Artemis 2.9.0.