1technophile / OpenMQTTGateway

MQTT gateway for ESP8266 or ESP32 with bidirectional 433mhz/315mhz/868mhz, Infrared communications, BLE, Bluetooth, beacons detection, mi flora, mi jia, LYWSD02, LYWSD03MMC, Mi Scale, TPMS, BBQ thermometer compatibility & LoRa.
https://docs.openmqttgateway.com
GNU General Public License v3.0
3.62k stars 795 forks source link

OpenMQTT Gateway stuck in 'offline' and 'unavailable' state in Home Assistant [With SOLUTION] #2026

Closed puterboy closed 2 months ago

puterboy commented 2 months ago

This may be more of a HA bug but wanted to describe the issue and post a solution in case anyone else encounters it...

After reflashing my esp32 several times with OMG_lilygo_rtl_433_ESP, the Gateway eemed to get stuck in an "offline" status on HA even though it was otherwise working perfectly fine.

Symptoms included:

HOWEVER, all the non-Gateway intrinsic MQTT sensors continued to work perfectly -- ie., the Gateway was working and publishing MQTT messages as usual to my Mosquitto broker (both for the Gateway and for the sensors it monitors), just the Gateway sensors themselves showed up as offline in HA.

This occurred even though:

I was finally able to make it show up online again in HA by manually publishing an 'online' message to the topic home/OMG_lilygo_rtl_433_ESP/LWT using MQTT Explorer (one could also use mosquitto_pub or the MQTT publish service in HA)

But wanted to post my experience here in case anyway else has the problem since it puzzled me for several hours...

puterboy commented 2 months ago

Weird... Now every time I reboot the esp32 device, it seems to crash and reboot itself every few 15-30 seconds for about 5-10 times until it finally stabilizes. It reboots before publishing any messages except for a solitary offline under the topic "LWT".

Then when it finally reboots and stabilizes it remains in the offline state until I manually publish an online message to the LWT topic.

Nothing really changed in the code, my network or HA other than me recompiling and flashing the esp32 gateway a bunch of times...

puterboy commented 2 months ago

Problem seems to be caused by the Mutex semaphore I wrapped around MQTT. Wrapping around mqtt->publish in PubMQTT works fine -- and indeed fixes the problem of corrupted discovery config messages.

void pubMQTT(const char* topic, const char* payload, bool retainFlag) {
  if (SYSConfig.XtoMQTT && !SYSConfig.offline) {
#ifdef ESP32
    if (xSemaphoreTake(xMqttMutex, pdMS_TO_TICKS(QueueSemaphoreTimeOutTask)) == pdFALSE) {
        Log.error(F("xMqttMutex not taken" CR));
        return;
    }
#endif
    if (mqtt && mqtt->connected()) {
      SendReceiveIndicatorON();
      Log.trace(F("[ OMG->MQTT ] topic: %s msg: %s " CR), topic, payload);
      mqtt->publish(topic, payload, 0, retainFlag);
    } else {
      Log.warning(F("MQTT not connected, aborting the publication" CR));
    }
#ifdef ESP32
    xSemaphoreGive(xMqttMutex);
#endif
  } else {
    Log.notice(F("[ OMG->MQTT deactivated or offline] topic: %s msg: %s " CR), topic, payload);
  }
}

However, if I then also similarly wrap mqtt->loop() as follows:

#ifdef ESP32
    if (xSemaphoreTake(xMqttMutex, pdMS_TO_TICKS(QueueSemaphoreTimeOutTask)) == pdTRUE) {
      mqtt->loop();
      xSemaphoreGive(xMqttMutex);     
    } else {
      Log.error(F("xMqttMutex not taken" CR));
    }
#else
    mqtt->loop();
#endif

Then it frequently but not always repeatedly crashes on booting as follows:

N: ************** Setup OpenMQTTGateway end **************
N: Reconfiguring MQTT client...
N: Connected to broker

assert failed: vTaskPriorityDisinheritAfterTimeout tasks.c:4922 (pxTCB != pxCurrentTCB[xPortGetCoreID()])

Backtrace: 0x40083c6d:0x3ffb2480 0x4008dd1d:0x3ffb24a0 0x40093ac9:0x3ffb24c0 0x400906b7:0x3ffb25f0 0x4008ee9a:0x3ffb2610 0x400d70e5:0x3ffb2650 0x400d7205:0x3ffb2680 0x400e3757:0x3ffb2700 0x400ee9e2:0x3ffb2720 0x400ee9f5:0x3ffb2740 0x400ef626:0x3ffb2760 0x400e9d2f:0x3ffb27d0 0x40137a41:0x3ffb2810

Not sure what is causing this though presumably there is some race condition that allows it sometimes not to crash :) Once it gets through this then it connects to MQTT and is as rock stable as without this.

Any ideas what could be causing this error? If anything I would have thought that adding a Mutex semaphore would increase stability rather than decrease it...

Removing the Mutex around mqtt->loop fixes the problem -- but not clear why wrapping it causes the crash...

1technophile commented 2 months ago

You can add this to your environment to decode the back trace: monitor_filters = esp32_exception_decoder

puterboy commented 2 months ago

I think I figured out the problem. Basically, the mqtt object has a property that executes handle_autodiscovery() which in turn calls pubMQTT. So basically you have the main thread, calling mqtt->loop protected by the xMqttMutex token which then calls pubMQTT which also wants to take the token creating an irresolvable conflict.

So, I think I will stick to just protecting pubMQTT which is necessary to avoid discovery topic message corruption.

And I will consider this closed :)

puterboy commented 2 months ago

Please see PR to make this fix: https://github.com/1technophile/OpenMQTTGateway/pull/2034

puterboy commented 2 months ago

Completed