ct-Open-Source / Basecamp

An Arduino library to ease the use of the ESP32 in IoT projects
GNU General Public License v3.0
254 stars 48 forks source link

Frequent MQTT posts, stopped MQTT connection, and task watchdog crashes on CPU0 #76

Open tobiasisenberg opened 5 years ago

tobiasisenberg commented 5 years ago

Hi all,

After about a year of tinkering I think I have now figured out a major issue, not just with Basecamp, but with the ESP and MQTT in general. If not fixed this issue repeatedly crashes the ESP32 and prevents 24/7 use. Here are my observations (and the fix):

I run my ESP32 with Basecamp continuously and it crashed at random times (2--4 times a week), not to a reboot but simply to a never-ending watchdog feedback loop which could only be stopped by power-cycling it. Specifically, this is about the task watchdog which complains about a blocked CPU0. Yet this is strange since the user programs run on CPU1 as far as I understand. I tried everything that is usually recommend in such situations, such as delays in the loop function etc.; see https://github.com/espressif/arduino-esp32/issues/922 for more information. These changes, however, did not fix the problem.

At some point I suspected that posting to MQTT when the MQTT connection was down was the problem, and indeed by preventing such postings I could fix some of my crashes for one device. Yet another ESP32 to which I also applied this fix still caused crashes.

A few days ago I upgraded to the new ESP32 board libraries (to 1.0.1), and now observed in my debug log that now the MQTT connection was frequently reset. I am not sure if the crashes disappeared (they can take hours to days to happen), they may or may not have gone as the new libraries changed the watchdog behavior but I did not test for a long time because this observation led me to the actual problem: posting several messages to MQTT too fast after each other (e.g., because several sensor readings needed to be communicated to MQTT).

So I ultimately fixed the issue by implementing a local mqttPublish function as follows which I call instead of directly calling iot.mqtt.publish:

// check if mqtt is connected before publishing, and add delay after each message
uint16_t mqttPublish(const char* topic, uint8_t qos, bool retain, const char* payload = nullptr, size_t length = 0, bool dup = false, uint16_t message_id = 0) {
  if (iot.mqtt.connected()) {
    iot.mqtt.publish(topic, qos, retain, payload, length, dup, message_id);
    delay(100); // it is essential to wait a bit after each message
  }
  else {
    DEBUG_PRINTLN("Tried to publish >" + String(payload) + "< to >" + String(topic) + "<. But MQTT is offline. Message lost.");
  }
}

Adding this 0.1s delay time (I am not sure what would be the minimum delay time) ultimately solved the problem for me, I have not seen a crash since. And now also the watchdog issue on CPU0 makes sense: CPU0 was blocked because of the too frequent MQTT calls caused parallel Wifi activity on CPU0 which probably lead to the problems.

So in case some of you have similar issues I would recommend to also use such a function. This also helps with devices that only do a few things and then sleep---without the delay they can cause issues as well if multiple messages are sent to MQTT. Depending on the use case I also sometimes use a different mqttPublish function that, if the MQTT is not connected, first waits a bit and tries again, but the gist is the same. Also note that you should not call this local function from your MQTT onConnect callback, that can lead to crashes as well.