eclipse / kapua

Eclipse Public License 2.0
222 stars 160 forks source link

Broker denies new connections #3143

Open pintify opened 3 years ago

pintify commented 3 years ago

Describe the bug Kura devices are unable to open new connections with the broker (some of the connections remains while they are not restarted).

The performance of the machine is on average values. Restarting Kapua restores the functionality and wipes the issue.

To Reproduce It is produced every several hours after restarting the running instance of Kapua but it has not been reproduced in a different device.

Version of Kapua 1.2.4

Type of deployment [ ] Local Vagrant deployment [x] Docker [ ] Openshift (in its variants) [ ] Others

Main component affected [ ] Console (in case of console please report info on which browser you encountered the problem) [ ] REST API [x] Message Broker [ ] - Others

Additional context

Already discussed in gitter.

Logs from broker:

10:25:49.168 [ActiveMQ BrokerService[message-broker] Task-1496] ERROR o.a.a.broker.TransportConnector - Could not accept connection from null : {}
java.lang.IllegalStateException: Timer already cancelled.
    at java.util.Timer.sched(Timer.java:397)
    at java.util.Timer.schedule(Timer.java:248)
    at org.apache.activemq.transport.mqtt.MQTTInactivityMonitor.startConnectChecker(MQTTInactivityMonitor.java:239)
    at org.apache.activemq.transport.mqtt.MQTTTransportFilter.start(MQTTTransportFilter.java:155)
    at org.apache.activemq.transport.mqtt.MQTTInactivityMonitor.start(MQTTInactivityMonitor.java:148)
    at org.apache.activemq.transport.TransportFilter.start(TransportFilter.java:64)
    at org.apache.activemq.broker.TransportConnection.start(TransportConnection.java:1071)
    at org.apache.activemq.broker.TransportConnector$1$1.run(TransportConnector.java:218)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Logs from Kura:

2020-11-17T10:45:50,180 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - Creating a new client instance
2020-11-17T10:45:50,180 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - Using memory persistence for in-flight messages
2020-11-17T10:45:50,181 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - # ------------------------------------------------------------
2020-11-17T10:45:50,181 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - #  Connection Properties
2020-11-17T10:45:50,181 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - #  broker    = ssl://mybroker:8883
2020-11-17T10:45:50,181 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - #  clientId  = Device
2020-11-17T10:45:50,181 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - #  username  = username
2020-11-17T10:45:50,181 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - #  password  = XXXXXXXXXXXXXX
2020-11-17T10:45:50,181 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - #  keepAlive = 30
2020-11-17T10:45:50,181 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - #  timeout   = 20
2020-11-17T10:45:50,182 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - #  cleanSession    = true
2020-11-17T10:45:50,182 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - #  MQTT version    = 3.1.1
2020-11-17T10:45:50,182 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - #  willDestination = $EDC/Account/Device/MQTT/LWT
2020-11-17T10:45:50,182 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - #  willMessage     = 
2020-11-17T10:45:50,182 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - #
2020-11-17T10:45:50,182 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - #  Connecting...
2020-11-17T10:45:50,182 [pool-25-thread-1] INFO  o.e.k.c.s.r.LogStatusRunnable - Notification LED fast blinking
2020-11-17T10:45:50,262 [MQTT Con: Morla] INFO  o.e.k.c.s.SSLSocketFactoryWrapper - SSL Endpoint Identification enabled.
2020-11-17T10:46:10,334 [qtp2089810402-96] WARN  o.e.k.c.d.t.m.MqttDataTransport - xxxxx  Connect failed. Forcing disconnect. xxxxx
2020-11-17T10:46:10,334 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - Closing client...
2020-11-17T10:46:10,345 [qtp2089810402-96] INFO  o.e.k.c.d.t.m.MqttDataTransport - Closed
2020-11-17T10:46:10,345 [pool-25-thread-1] INFO  o.e.k.c.s.r.LogStatusRunnable - Notification LED slow blinking
2020-11-17T10:46:10,345 [qtp2089810402-96] WARN  o.e.k.w.s.GwtNetworkServiceImpl - Error connecting
org.eclipse.kura.KuraConnectException: "Connection failed. Cannot connect"
    at org.eclipse.kura.core.data.transport.mqtt.MqttDataTransport.connect(MqttDataTransport.java:337)
    at org.eclipse.kura.core.data.DataServiceImpl.connect(DataServiceImpl.java:489)
    at org.eclipse.kura.web.server.GwtStatusServiceImpl.connectDataService(GwtStatusServiceImpl.java:103)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
Caused by: MqttException (0) - java.net.SocketTimeoutException: Read timed out
    at org.eclipse.paho.client.mqttv3.internal.ExceptionHelper.createMqttException(ExceptionHelper.java:38)
    at org.eclipse.paho.client.mqttv3.internal.ClientComms$ConnectBG.run(ClientComms.java:736)
    ... 1 more
Caused by: java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
pintify commented 3 years ago

It has just been reduced significantly the number of data entering Kapua in order to solve the issue.

pintify commented 3 years ago

Reducing the streaming of data has solved the issue apparently. What would be required to increase again the quantity of data. The resources used in the system doesn't seem to be saturated during the issue.

pintify commented 3 years ago

The service failed soon after the previous comment. Besides, it has been detected that some new devices are not detected correctly. Even after a restart, the new devices are only classified as Device Connections. In order to get them detected, the platform must be restarted again.

pintify commented 3 years ago

After some weeks with a periodic failure each day and daily reboots of the platform, we made a clean reinstallation removing all database data (from both H2 and ElasticSearch) and the issue seems to be solved now.

Regarding the new deployment, the pending data from the devices seems to be uploading slowly and some devices required some hours to be properly connected, possibly because of the messages accumulation. Could this be the cause of the original problem? In that case, how could it be solved?