bertmelis / espMqttClient

MQTT 3.1.1 client library for the Espressif devices ESP8266 and ESP32 on the Arduino framework.
https://www.emelis.net/espMqttClient/
MIT License
100 stars 21 forks source link

Reconnect does not work if the broker is offline for a longer time (>1,5h) #102

Closed tbnobody closed 1 year ago

tbnobody commented 1 year ago

I currently face the problem, that if the broker goes offline the library tries to reconnect for about 1,5h (a second test took about 5 hours until it stopped) and then it stops reconnecting. It seems that the disconnect event is missing. I've implemented it as in your previous examples using a timer in the disconnect event. (Please see code samples below)

If I stop the machine with the broker I get the following output (as expected):

21:40:47.157 > Disconnected from MQTT.
21:40:47.159 > Disconnect reason:TCP_DISCONNECTED
21:40:48.509 > Fetch inverter: 116480145268
21:40:48.530 > Request SystemConfigPara
21:40:49.159 > Connecting to MQTT...
21:40:52.174 > Disconnected from MQTT.
21:40:52.176 > Disconnect reason:TCP_DISCONNECTED
21:40:54.176 > Connecting to MQTT...
21:40:57.186 > Disconnected from MQTT.
21:40:57.189 > Disconnect reason:TCP_DISCONNECTED
21:40:59.188 > Connecting to MQTT...
21:41:02.202 > Disconnected from MQTT.
21:41:02.204 > Disconnect reason:TCP_DISCONNECTED

This repeats again and again... If I start the broker quick enough (lets say after 10minutes) everything works fine again. For testing purposes I just let the broker offline and got the following behavior:

22:15:49.367 > Connecting to MQTT...
22:15:52.377 > Disconnected from MQTT.
22:15:52.398 > Disconnect reason:TCP_DISCONNECTED
22:15:54.380 > Connecting to MQTT...
22:15:56.580 > Request SystemConfigPara
22:15:57.390 > Disconnected from MQTT.
22:15:57.393 > Disconnect reason:TCP_DISCONNECTED
22:15:59.392 > Connecting to MQTT...
22:16:02.402 > Disconnected from MQTT.
22:16:02.405 > Disconnect reason:TCP_DISCONNECTED
22:16:04.404 > Connecting to MQTT...

Here the output just stops... no more reconnect attempts. The connection state returns false: https://github.com/tbnobody/OpenDTU/blob/9b0d2ff25f48fd39155894351d4069fba5900efa/src/MqttSettings.cpp#L150

Which platform, esp8266 or esp32? ESP32 Do you use TLS or not? NO TLS Do you use an IDE (Arduino, Platformio...)? PlatformIO Which version of the Arduino framework? espressif32@6.3.1

Example code Connect: https://github.com/tbnobody/OpenDTU/blob/9b0d2ff25f48fd39155894351d4069fba5900efa/src/MqttSettings.cpp#L127 Disconnect event: https://github.com/tbnobody/OpenDTU/blob/9b0d2ff25f48fd39155894351d4069fba5900efa/src/MqttSettings.cpp#L52-L81

Other Remarks Not sure why you removed the example with the timer in the disconnect event. I could also add a workaround to check wether the connection state is disconnected and the last connect attempt is older then e.g. 1 minute. I am just not sure what should be the right behavior.

bertmelis commented 1 year ago

I don't know what could cause this. I have limited internet access at the moment so can't do much now.

One thing you can do to improve your code though is to do something with the return value of the connect-call. It returns true is successful of false if not. I don't think this wil solve the issue though.

I'll have a look when I'm back at my computer but I'm very busy with something else now.

bertmelis commented 1 year ago

Looking back at the connect-code, it can be improved though:

tbnobody commented 1 year ago

I'll have a look when I'm back at my computer but I'm very busy with something else now.

Not a problem. Take your time!

I will also try to generate a sample code with much less application code around to prevent any other issues in my code.

proddy commented 1 year ago

I can also help out here. @MichaelDvP ported our mqtt code for the EMS-ESP project from asyncmqttclient to espMqttClient and so far it's been great. We're still stress testing but when it goes into the next stable build of our project it will be used by thousands of end users, so a good source of user testing :)

MichaelDvP commented 1 year ago

@tbnobody You reconnect (line 79) only if you get the onDisconnect callback, but you only call connect() without checkiing if it connects. if connect() fails you ran into this situation. I've mainly used the code from examples and poll connection state in the loop every 2 seconds. Try to trigger the reconnect also in line 127 to

if (!mqttClient->connect()) {
    mqttReconnectTimer.once( 2, +[](MqttSettingsClass* instance) { instance->performConnect(); }, this);
}

But i think polling the state is better.

tbnobody commented 1 year ago

Thank you for this hint! Will doublecheck later!

My current assumption was, that in case of a failure I will receive a onDisconnect event. If this is not the case, then indeed the polling method (or checking the return value of the connect method) is required.

MichaelDvP commented 1 year ago

The connect() tries to send out a connect packet, if the packet is created it returns true, state goes to "connectingTCP1" and you'll get the the onDisconnect if the connAck is missing. But if connect packet creation failed state is still "disconnected" and connect() returns false. I think this happens to you. See line103-131

bertmelis commented 1 year ago

That would indicate a memory issue though. And in that case leaks are the first thing to wacht out for. (I valgrinded this lib but not each and every scenario)

tbnobody commented 1 year ago

I've just added some debug output in case that the connect method returns false. Now I am waiting... I am courious if this is the cause.

bertmelis commented 1 year ago

The underlying WiFiClient doesn't cleanup immediately after failing/closing so memory could become an issue. Eventually it recovers though.

tbnobody commented 1 year ago

Just a small response of my latest tests... It took ~8h before the reconnect stopped working. The connect() method did NOT return false on the last attempt. My next step would be two write a smaller example and try to reproduce it there.

bertmelis commented 1 year ago
  • the return value has to be initialized to false and set to true only when successful (bug)
  • state has to be set before resuming the task (just-in-case-improvement)

Could you test with these changes? I can't prepare a PR or branch myself. Still away from computer.

tbnobody commented 1 year ago

I have just applied this patch:

diff --git a/src/MqttClient.cpp b/src/MqttClient.cpp
index bae592e..256fc2f 100644
--- a/src/MqttClient.cpp
+++ b/src/MqttClient.cpp
@@ -101,7 +101,7 @@ bool MqttClient::disconnected() const {
 }

 bool MqttClient::connect() {
-  bool result = true;
+  bool result = false;
   if (_state == State::disconnected) {
     EMC_SEMAPHORE_TAKE();
     if (_addPacketFront(_cleanSession,
@@ -114,17 +114,17 @@ bool MqttClient::connect() {
                         _willPayloadLength,
                         (uint16_t)(_keepAlive / 1000),  // 32b to 16b doesn't overflow because it comes from 16b orignally
                         _clientId)) {
+      result = true;
+      _state = State::connectingTcp1;
       #if defined(ARDUINO_ARCH_ESP32)
       if (_useInternalTask == espMqttClientTypes::UseInternalTask::YES) {
         vTaskResume(_taskHandle);
       }
       #endif
-      _state = State::connectingTcp1;
     } else {
       EMC_SEMAPHORE_GIVE();
       emc_log_e("Could not create CONNECT packet");
       _onError(0, Error::OUT_OF_MEMORY);
-      result = false;
     }
     EMC_SEMAPHORE_GIVE();
   }

Now waiting for something to happen (or not) :-)

tbnobody commented 1 year ago

The reconnect is working since ~24h with this patch. IMHO, the return value does not have any effect on this issue. It should be related to the reordering of the vTaskResume and _state = State::connectingTcp1; But in any case, it seems to solve the problem!

MichaelDvP commented 1 year ago

That explains why we don't see this issue, we use UseInternalTask::NO (to save memory), so the state is always set before the loop checks.

bertmelis commented 1 year ago

The reconnect is working since ~24h with this patch. IMHO, the return value does not have any effect on this issue. It should be related to the reordering of the vTaskResume and _state = State::connectingTcp1; But in any case, it seems to solve the problem!

You're welcome to create a PR. 😉

tbnobody commented 1 year ago

~40hours working without problems... I will create two PR's as the change of the return value is theoretical a breaking change as it changes the behavior of the connect method when it is called again while the state is still not disconnected

bertmelis commented 1 year ago

You're right about the breaking change. It was unintended though.

bertmelis commented 1 year ago

I will merge both and create a new release as soon as I'm back at my computer.

bertmelis commented 1 year ago

Fixed in https://github.com/bertmelis/espMqttClient/releases/tag/v1.4.4

Please reopen if the fix doesn't hold.

GnomiBerlin commented 1 year ago

Hi, I had the same observation due to an instable WiFi connection to an ESP32-CAM module. http Server was always reachable after WiFi reconnects, but the MQTT stops after some time.

Then I changed to this 1.4.4. version, but still have this problem. In the http page I added two links for ESP.restart() and for mqttClient.disconnect(). Both are working. But after some time, when MQTT is dead, the mqttClient.disconnect() is not helping.

I want not to open this issue as it is may be another reason, but MQTT is not stable.

Environment:

Dependency Graph
|-- espMqttClient @ 1.4.4
|-- AsyncElegantOTA @ 2.2.7
|-- AsyncTCP-esphome @ 2.0.0
|-- ESPAsyncWebServer-esphome @ 3.0.0
|-- SPIFFS @ 2.0.0
|-- AsyncTCP @ 1.1.1
|-- Ticker @ 2.0.0
|-- WiFi @ 2.0.0

Code snippets:

void onMqttDisconnect(espMqttClientTypes::DisconnectReason reason) {
  #ifdef DEBUG
  Serial.printf("Disconnected from MQTT: %u.\n", static_cast<uint8_t>(reason));
  #endif
  if (WiFi.isConnected()) {
    reconnectMqtt = true;
    lastReconnect = millis();
  }
}
// Reconnect MQTT  via Webinterface using disconnect() and auto-reconnect from async MQTT
  httpserver.on("/MQTTrestart", HTTP_GET, [](AsyncWebServerRequest *request)
    {
    String htmlBody = htmlHead_pre_b + htmlHead + "<h1>MQTT restarting ...</h1>";
    request->send(200, "text/html", htmlBody);
    delay(500);
    if (!mqttClient.disconnect()) {
    #ifdef DEBUG
    Serial.println("Disconnecting failed.");
    #endif
    } else {
    #ifdef DEBUG
    Serial.println("Disconnection done.");
    #endif
    } 
    });

Debugs: After ESP32 restart:

Connecting to MQTT...
Connection done.
Connected to MQTT broker: 192.168.xxx.xxx, port: 1883
************************************************
Session present: 0
Last Will -online- setting- QoS 1/retain false, packetId: 1
Subscribing at QoS 1, packetId: 2
Subscribing at QoS 1, packetId: 3
Subscribing at QoS 1, packetId: 4
Subscribing at QoS 1, packetId: 5
Subscribing at QoS 1, packetId: 6
Subscribing at QoS 1, packetId: 7
Subscribing at QoS 1, packetId: 8
Subscribing at QoS 1, packetId: 9

MQTT Publish succesful: 10
Publish acknowledged.
  packetId: 1
Subscribe acknowledged.
  packetId: 2
  qos: 1
Subscribe acknowledged.
  packetId: 3
  qos: 1
Subscribe acknowledged.
  packetId: 4
  qos: 1
Subscribe acknowledged.
  packetId: 5
  qos: 1
Subscribe acknowledged.
  packetId: 6
  qos: 1
Subscribe acknowledged.
  packetId: 7
  qos: 1
Subscribe acknowledged.
  packetId: 8
  qos: 1
Subscribe acknowledged.
  packetId: 9
  qos: 1
Publish acknowledged.
  packetId: 10

After mqttClient.disconnect():

Disconnection done.
Disconnected from MQTT: 0.
Connecting to MQTT...
Connection done.
Connected to MQTT broker: 192.168.xxx.xxx, port: 1883
************************************************
Session present: 0
Last Will -online- setting- QoS 1/retain false, packetId: 11
Subscribing at QoS 1, packetId: 12
Subscribing at QoS 1, packetId: 13
Subscribing at QoS 1, packetId: 14
Subscribing at QoS 1, packetId: 15
Subscribing at QoS 1, packetId: 16
Subscribing at QoS 1, packetId: 17
Subscribing at QoS 1, packetId: 18
Subscribing at QoS 1, packetId: 19

MQTT Publish succesful: 20
Publish acknowledged.
  packetId: 11
Subscribe acknowledged.
  packetId: 12
  qos: 1
Subscribe acknowledged.
  packetId: 13
  qos: 1
Subscribe acknowledged.
  packetId: 14
  qos: 1
Subscribe acknowledged.
  packetId: 15
  qos: 1
Subscribe acknowledged.
  packetId: 16
  qos: 1
Subscribe acknowledged.
  packetId: 17
  qos: 1
Subscribe acknowledged.
  packetId: 18
  qos: 1
Subscribe acknowledged.
  packetId: 19
  qos: 1
Publish acknowledged.
  packetId: 20