1technophile / OpenMQTTGateway

MQTT gateway for ESP8266 or ESP32 with bidirectional 433mhz/315mhz/868mhz, Infrared communications, BLE, Bluetooth, beacons detection, mi flora, mi jia, LYWSD02, LYWSD03MMC, Mi Scale, TPMS, BBQ thermometer compatibility & LoRa.
https://docs.openmqttgateway.com
GNU General Public License v3.0
3.62k stars 797 forks source link

CORRUPTION in display of SSID and construction of discovery config strings #2012

Closed puterboy closed 2 months ago

puterboy commented 3 months ago

Describe the bug I just upgraded to the latest dev branch and am finding two potentially related errors: Note: The problem is not due to my discovery_prefix patches since I reverted and still see the problems Note: I also noted it on 2 different devices (lilygo-rtl-433 and esp32dev-ble)

  1. In the WebUI, under "Configure WiFi" the WiFi network name appears as gibberish characters -- this persist even if I re-name the network. Note that the stored name must be working since it attaches to the correct network. Screenshot 2024-08-16 165215

  2. Separately, but perhaps related, looking at the discovery topic config lines (via mosquitto_sub), there is gibberish and over-writing in the value of the 'stat_t' key. For example:

    {"stat_t":"+/+/RTL_433toMQTT/LaCrosse-TX141THBv2/1/212","0�Chome/OMG_lilygo_rtl_433_ESP/RTL_433toMQTT/LaCrosse-TX141THBv2/1/212{"model":"LaCrosse-TX141THBv2","id":212,"channel":1,"battery_ok":1,"temperature_C":-18.3,"humidity":78,"test":"No","mic":"CRC","protocol":"LaCrosse TX141-Bv2, TX141TH-Bv2, TX141-Bv3, TX141W, TX145wsdth, (TFA, ORIA) sensor","rssi":-67,"duration":143000}dev_cla":"humidity","unit_of_meas":"%","name":"H

    It as if it the components constructing the json string are pointing to the wrong elements of memory since the resulting json string looks like a combination of a valid discovery topic config keys & values PLUSsome elements of a published data topic PLUS some elements of the config string of another entity for the device -- all mashed together.

Note that it should look something like:

{"stat_t":"+/+/RTL_433toMQTT/LaCrosse-TX141THBv2/1/212","dev_cla":"battery","unit_of_meas":"%","name":"battery","uniq_id":"LaCrosse-TX141THBv2-1-212-battery_ok","val_tpl":"{{ float(value_json.battery_ok) * 99 + 1 | is_defined }}","state_class":"measurement","device":{"ids":["LaCrosse-TX141THBv2-1-212"],"cns":[["mac","LaCrosse-TX141THBv2-1-212"]],"mdl":"LaCrosse-TX141THBv2","name":"LaCrosse-TX141THBv2-1-212","via_device":"OMG_lilygo_rtl_433_ESP"}}

To Reproduce Compile latest dev branch

Expected behavior See above

Screenshots See above

Environment (please complete the following information):

puterboy commented 3 months ago

Interestingly, on looking closer some but not all discovery config messages are corrupted. It's hard to find a pattern of which ones are vs. aren't. It's not device specific.

1technophile commented 3 months ago

Note: I also noted it on 2 different devices (lilygo-rtl-433 and esp32dev-ble)

Both with development version?

puterboy commented 3 months ago

Yes both development

puterboy commented 3 months ago

Let me summarize: Regarding corruption of Wifi Network under the WebUI, the following ALL SHOW CORRUPTION of the display:

(Note my ssid itself has no weird characters -- it is of form myssid-2.4)

Regarding corruption of the discovery config topic (only tested on lilygo-rtl-433 device)

SO seems like potentially 2 separate bugs...

  1. The corruption of the WiFi Network display goes back to at least V1.7.0
  2. The corruption of the discovery topic seems to have happened after v175

The above is all reproducible going back and forth between versions.

puterboy commented 3 months ago

Note the above PR fixes ones of the bugs reported here (I initially thought they were related since they both involved corruption) but now I realize they are separate.

I still haven't figured out why in there is sporadic corruption of the discovery JSON strings in the development branch (but v175 is fine).

puterboy commented 3 months ago

Interestingly, whereas before it would generate discovery topics for temperature, humidity, battery and rssi, now it generates all the humidity ones, some of the rssi ones, a couple of the temperature ones, and none of the battery ones..

What could be causing this???? .

1technophile commented 3 months ago

Memory/concurrency issues due to the number of devices being processed. Try to play with the stack size for RTL_433, it may reduce/remove those.

puterboy commented 3 months ago

Any specific suggestions of what variables to try?

puterboy commented 3 months ago

OK - I was able to fix the problem of corrupted discovery data by reverting one of the recent changes to ZgatewayRTL_433.ino

    //RFrtl_433_ESPdata["origin"] = (char*)topic.c_str();
    //handleJsonEnqueue(RFrtl_433_ESPdata);
    pub(topic.c_str(), RFrtl_433_ESPdata);

Changing it back to:

    RFrtl_433_ESPdata["origin"] = (char*)topic.c_str();
    handleJsonEnqueue(RFrtl_433_ESPdata);

eliminated the corruption of the discovery strings...

HOWEVER reverting this causes the spurious spikes noted in to return https://github.com/1technophile/OpenMQTTGateway/issues/2014

Said another way:

Not sure how I can resolve one bug without causing the other to appear and vice-versa**

Note I can understand why perhaps using pub directly without queueing it could lead to some conflict that would cause corruption of the discovery data. However, it's unclear to me why reverting to json enqueueuing the discovery data would lead to sporadic corruption of the actual MQTT topic messages -- unless perhaps the discovery messages are too long or consume too much of the queue before being cleared (though I would have thought that would only be an issue after restart since discovery messages are supposedly not republished if unchanged).

Would appreciate some help and insight in debugging this further :) Thanks!

puterboy commented 3 months ago

One other thing that may be concerning (but seems to work ok now) is the newly added lines:

char deviceKeyParameter[25];
        memcpy(deviceKeyParameter, &pdevice->uniqueId[strlen(pdevice->uniqueId) - strlen(parameters[i][0])], strlen(parameters[i][0]));
        deviceKeyParameter[strlen(parameters[i][0])] = '\0';
        Log.trace(F("deviceKeyParameter: %s" CR), deviceKeyParameter);

If this is meant to copy the uniqueID, I am concerned that 25 chars may not be enough as I already have some entities with UniqueID names longer than 25 chars, such as Ambientweather-F007TH-2-178-temperature_C (assuming the definition of uniqueId is the same as what is used in HA core_entity.registry). Of course if uniqueID has a different meaning here and is just a short numerical or hex string then disregard this concern.

puterboy commented 3 months ago

The above PR fixes the second of the 2 issues initially listed in this bug report. However, I still can't figure out why the handleJsonEnqueue method causes sporadic corruption in the MQTT topic messages. I even tried add a Mutex to the emptyQueue() routine thinking maybe something was being added to the queue while it was being popped and that didn't help.

Rather than bypassing the queue by just publishing directly, it would be helpful to figure out why it is failing...

puterboy commented 2 months ago

Fixed as per https://github.com/1technophile/OpenMQTTGateway/pull/2034 and https://github.com/1technophile/OpenMQTTGateway/pull/2014