1technophile / OpenMQTTGateway

MQTT gateway for ESP8266 or ESP32 with bidirectional 433mhz/315mhz/868mhz, Infrared communications, BLE, Bluetooth, beacons detection, mi flora, mi jia, LYWSD02, LYWSD03MMC, Mi Scale, TPMS, BBQ thermometer compatibility & LoRa.
https://docs.openmqttgateway.com
GNU General Public License v3.0
3.59k stars 790 forks source link

Upgraded Lilygo Lora T3 v1.6.1 to OMG 1.7 from 1.6 and it is repeatedly rebooting #1836

Closed vincenpt closed 5 months ago

vincenpt commented 10 months ago

Before submitting a problem please check the troubleshooting section https://docs.openmqttgateway.com/upload/troubleshoot.html

Describe the bug I have several Lilygo LORA t3 v1.6.1 boards that are listening on the 433 mhz frequency. Each of these boards had v1.6 of OMG installed on them and was working fine.

Today I saw there was an update available for OMG to 1.7 so I used the built in WebUI to initiate an upgrade to v1.7. Immediately after the upgrade I noticed the boards were repeatedly rebooting.

The screen reports:

I took one of the malfunctioning devices, unplugged it, let it sit for a few minutes and plugged it back in and it exhibited the same behavior.

I then connected it via USB to my laptop and reflashed it with 1.7 via Chrome (checking the checkbox to erase the flash). I was able to connect to the self hosted wifi connection & setup my Wifi & MQTT settings, but once I saved the settings the device started the same boot loop again.

During one of the boot cycles I was able to grab the console log via the web interface:

N: OpenMQTTGateway URL: http://192.168.1.215/
N: ZwebUI setup done
N: RF Config not found using default
N: Enable RTL_433 Receiver: 433.92Mhz
N: ZgatewayRTL_433 setup done 
N: OpenMQTTGateway modules: ["LilyGo_SSD1306","WebUI","rtl_433"]
N: ************** Setup OpenMQTTGateway end **************
W: MQTT connection...
N: Connected to broker
N: Send on /RFtoMQTT msg {"active":3,"frequency":433.92,"rssithreshold":-82,"rssi":-86,"avgrssi":0,"count":10,"ookthreshold":15}
N: Update check, free heap: 104572N: Update check done, free heap: 64828N: Send on /SYStoMQTT msg {"uptime":18,"version":"v1.7.0","disc":true,"ohdisc":false,"env":"lilygo-rtl_433","freemem":107044,"mqttp":"1883","mqtts":false,"msgprc":0,"msgblck":0,"maxq":0,"minmem":102248,"tempc":53.89,"freestck":5980,"eth":false,"rssi":-62,"SSID":"1qw0pou","BSSID":"80:2A:A8:47:79:98","ip":"192.168.1.215","mac":"64:B7:08:89:FE:08","modules":["LilyGo_SSD1306","WebUI","rtl_433"]}
N: Send on /SSD1306toMQTT msg {"onstate":true,"brightness":50,"display-flip":true,"idlelogo":true,"log-oled":false,"json-oled":true}
N: Send on /WebUItoMQTT msg {"displayMetric":true,"webUISecure":true,"displayQueue":0}
N: Send on /RLStoMQTT msg {"latest_version":"v1.7.0","title":"OpenMQTTGateway","release_url":"https://github.com/1technophile/OpenMQTTGateway/releases/tag/v1.7.0","installed_version":"v1.7.0","entity_picture":"https://github.com/1technophile/OpenMQTTGateway/raw/development/docs/img/Openmqttgateway_logo_mini_margins.png","release_summary":""}
N: Send on /RTL_433toMQTT/Acurite-Tower/A/7575 msg {"model":"Acurite-Tower","id":7575,"channel":"A","battery_ok":1,"temperature_C":19.4,"humidity":38,"mic":"CHECKSUM","protocol":"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas","rssi":-76,"duration":250000}
N: Send on /RTL_433toMQTT/Acurite-Tower/C/15652 msg {"model":"Acurite-Tower","id":15652,"channel":"C","battery_ok":0,"temperature_C":19.6,"humidity":44,"mic":"CHECKSUM","protocol":"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas","rssi":-48,"duration":127000}
N: Send on /RTL_433toMQTT/Acurite-Tower/A/6190 msg {"model":"Acurite-Tower","id":6190,"channel":"A","battery_ok":1,"temperature_C":18.6,"humidity":48,"mic":"CHECKSUM","protocol":"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas","rssi":-80,"duration":211000}

I am not sure if it is related, but my MQTT base topic is set to 433_direct/raw/, which I know is non-standard.

To Reproduce Steps to reproduce the behavior:

  1. Install OMG 1.7 on a Lilygo LORA T3 v1.6.1
  2. Configure Wifi & MQTT server settings
  3. Once the device restarts from the settings change, it will reboot after receiving messages.

Expected behavior I expect the device to not repeatedly reboot.

Screenshots If applicable, add screenshots to help explain your problem. Video link showing the issue - https://photos.app.goo.gl/hJ7EdvSELMk9ZLGU7

Environment (please complete the following information):

Additional context Add any other context about the problem here.

1technophile commented 10 months ago

Thanks for such a detailled description. Could you share the message from the serial console just before the reboot please?

pvince commented 10 months ago

Accidentally used my work account above, same person here. Took me a little bit to figure out the baud rate (I think it is 115200 ? )

I used PuTTY and the line wrapping seems a bit funky in places, but this is what is spit out right before the reboot:

N: Send on /RTL_433toMQTT/Acurite-Tower/A/7575 msg {"model":"Acurite-Tower","id":7575,"channel":"A","battery_ok":1,"temperature                                                                                                                                   _C":19,"humidity":38,"mic":"CHECKSUM","protocol":"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas","rssi":-55,"duration":122000}
                                                                                                                                                                                    N: Sen                                                                                                                                   d on /RTL_433toMQTT/Acurite-Tower/A/8623 msg {"model":"Acurite-Tower","id":8623,"channel":"A","battery_ok":1,"temperature_C":7.5,"humidity":41,"mic":"CHECKSUM","protocol":"Acurite 592TXR                                                                                                                                    Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas","rssi":-64,"duration":122000}
                                                                                                                    N: Send on /RTL_433toMQTT/Acurite-Tower/A/5812 msg {"model":"Acurite-T                                                                                                                                   ower","id":5812,"channel":"A","battery_ok":1,"temperature_C":8.400001,"humidity":66,"mic":"CHECKSUM","protocol":"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Light                                                                                                                                   ning, 899 Rain, 3N1, Atlas","rssi":-51,"duration":278000}
                                                         N: Send on /RTL_433toMQTT/Acurite-Tower/A/6190 msg {"model":"Acurite-Tower","id":6190,"channel":"A","battery_ok":1,"temperature_C                                                                                                                                   ":18.3,"humidity":48,"mic":"CHECKSUM","protocol":"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas","rssi":-62,"duration":126000}
                                                                                                                                                                                    N: Sen                                                                                                                                   d on /RTL_433toMQTT/Acurite-Tower/A/6248 msg {"model":"Acurite-Tower","id":6248,"channel":"A","battery_ok":0,"temperature_C":16.9,"humidity":42,"mic":"CHECKSUM","protocol":"Acurite 592TX                                                                                                                                   R Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas","rssi":-73,"duration":134000}
                                                                                                                     N: Send on /RTL_433toMQTT/Acurite-Tower/C/2056 msg {"model":"Acurite-                                                                                                                                   Tower","id":2056,"channel":"C","battery_ok":1,"temperature_C":18.3,"humidity":47,"mic":"CHECKSUM","protocol":"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightnin                                                                                                                                   g, 899 Rain, 3N1, Atlas","rssi":-80,"duration":379000}
                                                      N: Send on /RTL_433toMQTT/Acurite-Tower/C/15652 msg {"model":"Acurite-Tower","id":15652,"channel":"C","battery_ok":0,"temperature_C"                                                                                                                                   :18.8,"humidity":45,"mic":"CHECKSUM","protocol":"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas","rssi":-50,"duration":117996}
                                                                                                                                                                                   N: Send                                                                                                                                    on /RTL_433toMQTT/Acurite-Tower/A/13612 msg {"model":"Acurite-Tower","id":13612,"channel":"A","battery_ok":0,"temperature_C":17.5,"humidity":45,"mic":"CHECKSUM","protocol":"Acurite 592T                                                                                                                                   XR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas","rssi":-81,"duration":173000}
                                                                                                                      N: Send on /RTL_433toMQTT/Acurite-Tower/A/7575 msg {"model":"Acurite                                                                                                                                   -Tower","id":7575,"channel":"A","battery_ok":1,"temperature_C":19,"humidity":38,"mic":"CHECKSUM","protocol":"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning                                                                                                                                   , 899 Rain, 3N1, Atlas","rssi":-56,"duration":228000}
                                                     Guru Meditation Error: Core  1 panic'ed (Unhandled debug exception).
Debug exception reason: Stack canary watchpoint triggered (rtl_433_Decoder)
Core  1 register dump:
PC      : 0x400d7910  PS      : 0x00060836  A0      : 0x800d7cd6  A1      : 0x3ffe0bc0
A2      : 0x3ffe0bf8  A3      : 0x3ffe1380  A4      : 0x00060820  A5      : 0x00060823
A6      : 0x00060823  A7      : 0x00000001  A8      : 0x800d7910  A9      : 0x3ffe0b90
A10     : 0x3ffe0c10  A11     : 0x3ffe0bf8  A12     : 0x00000300  A13     : 0x3ffe1380
A14     : 0x007bfb48  A15     : 0x003fffff  SAR     : 0x0000001b  EXCCAUSE: 0x00000001
EXCVADDR: 0x00000000  LBEG    : 0x40089de0  LEND    : 0x40089df6  LCOUNT  : 0x00000000

Backtrace: 0x400d790d:0x3ffe0bc0 0x400d7cd3:0x3ffe0bf0 0x400d7e72:0x3ffe1270 0x400d7ea2:0x3ffe1290 0x400d7ed9:0x3ffe12b0 0x400de58d:0x3ffe12d0 0x40127dd5:0x3ffe16d0 0x401d7662:0x3ffe1710                                                                                                                                    0x4011abdb:0x3ffe1730 0x401293a7:0x3ffe3190 0x4012a205:0x3ffe31b0 0x40128130:0x3ffe31f0 0x400fd80f:0x3ffe3220

ELF file SHA256: e59a68d516c7b5a8

Rebooting...
ets Jun  8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 188777542, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:1184
load:0x40078000,len:13160
load:0x40080400,len:3036
entry 0x400805e4
N:
   ************* WELCOME TO OpenMQTTGateway **************
                                                          [   169][E][Preferences.cpp:50] begin(): nvs_open failed: NOT_FOUND
N: SYS config not found
                       N: SSD1306 config initialised
                                                    [   171][E][Preferences.cpp:50] begin(): nvs_open failed: NOT_FOUND
N: No SSD1306 config to load
                            N: Setup SSD1306 Display end
                                                        N: OpenMQTTGateway Version: v1.7.0

...
1technophile commented 10 months ago

Could you try with the standard base topic home/

pvince commented 10 months ago

Still crashed, it seems to happen at kinda random intervals. The first crash happened very shortly after the settings got saved & it rebooted, the second crash took ~1 minute before it crashed:

N: Send on /RTL_433toMQTT/Acurite-Tower/C/7425 msg {"model":"Acurite-Tower","id":7425,"channel":"C","battery_ok":0,"temperature_C":20.6,"humidity":55,"mic":"CHECKSUM","protocol":"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas","rssi":-72,"duration":124001}
                                                                                                                                                         E: [ WebUI ] webUIQueue full, discarding signal RTL_433toMQTT
                                                            N: Send on /RTL_433toMQTT/Acurite-Tower/A/5476 msg {"model":"Acurite-Tower","id":5476,"channel":"A","battery_ok":1,"temperature_C":17.6,"humidity":43,"mic":"CHECKSUM","protocol":"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas","rssi":-65,"duration":118000}
                                                             E: [ WebUI ] webUIQueue full, discarding signal RTL_433toMQTT
                                                                                                                          Guru Meditation Error: Core  1 panic'ed (Unhandled debug exception).
Debug exception reason: Stack canary watchpoint triggered (rtl_433_Decoder)
Core  1 register dump:
PC      : 0x400d7910  PS      : 0x00060836  A0      : 0x800d7cd6  A1      : 0x3ffe0940
A2      : 0x3ffe0978  A3      : 0x3ffe1100  A4      : 0x00060820  A5      : 0x00060823
A6      : 0x00060823  A7      : 0x00000001  A8      : 0x800d7910  A9      : 0x3ffe0910
A10     : 0x3ffe0990  A11     : 0x3ffe0978  A12     : 0x00000300  A13     : 0x3ffe1100
A14     : 0x007bfb48  A15     : 0x003fffff  SAR     : 0x00000018  EXCCAUSE: 0x00000001
EXCVADDR: 0x00000000  LBEG    : 0x40089de0  LEND    : 0x40089df6  LCOUNT  : 0x00000000

Backtrace: 0x400d790d:0x3ffe0940 0x400d7cd3:0x3ffe0970 0x400d7e72:0x3ffe0ff0 0x400d7ea2:0x3ffe1010 0x400d7ed9:0x3ffe1030 0x400de58d:0x3ffe1050 0x40127dd5:0x3ffe1450 0x401d7662:0x3ffe1490 0x4011abdb:0x3ffe14b0 0x401293a7:0x3ffe2f10 0x4012a205:0x3ffe2f30 0x40128130:0x3ffe2f70 0x400fd80f:0x3ffe2fa0

ELF file SHA256: e59a68d516c7b5a8

Rebooting...
ets Jun  8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 188777542, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:1184
load:0x40078000,len:13160
load:0x40080400,len:3036
entry 0x400805e4
N:
   ************* WELCOME TO OpenMQTTGateway **************
                                                          [   172][E][Preferences.cpp:50] begin(): nvs_open failed: NOT_FOUND
N: SYS config not found
                       N: SSD1306 config initialised
                                                    [   174][E][Preferences.cpp:50] begin(): nvs_open failed: NOT_FOUND
N: No SSD1306 config to load
                            N: Setup SSD1306 Display end
                                                        N: OpenMQTTGateway Version: v1.7.0
                                                                                          {
  "mqtt_server": "192.168.1.127",
  "mqtt_port": "1883",
  "mqtt_user": "mqtt_hass",
  "mqtt_pass": "REDACTED",
  "mqtt_topic": "home/",
  "gateway_name": "OMG_lilygo_rtl_433_ESP_2",
  "mqtt_broker_secure": false,
  "mqtt_broker_cert": "",
  "mqtt_ss_index": 0,
  "ota_server_cert": "",
  "ota_pass": "REDACTED"
}N: Config loaded from flash
                            *wm:[2] Added Parameter: server
*wm:[2] Added Parameter: port
*wm:[2] Added Parameter: user
*wm:[2] Added Parameter: pass
*wm:[2] Added Parameter: secure
*wm:[2] Added Parameter: cert
*wm:[2] Added Parameter: name
*wm:[2] Added Parameter: topic
*wm:[2] Added Parameter: ota
N: Attempting Wifi connection with saved AP: 0
                                              [  4450][E][Preferences.cpp:50] begin(): nvs_open failed: NOT_FOUND
N: No WebUI config to load
                          N: OpenMQTTGateway URL: http://192.168.1.215/
                                                                       N: ZwebUI setup done
                                                                                           [  

Other background, not sure how relevant any of it is. I have ~25 Acurite temperature // humidity sensors. Each sensor sends out a reading roughly once every 15 seconds. Also I am in a somewhat dense suburban area that picks up a lot of stray noise from neighbors devices.

1technophile commented 10 months ago

Could you try this firmware please? lilygo-rtl433-firmware.zip

Erase the credentials of the board with the WebUI Connect to the WiFi Manager portal Click Update Select the file above unzipped Update

And share if you have the same behavior.

pvince commented 10 months ago

Updated: image

Still crashed:

                                                                                                                            N: Send on /RTL_433toMQTT/Acurite-Tower/A/15411 msg {"model":"Acurite-Tower","id":15411,"channel":"A","battery_ok":1,"temperature_C":4.4,"humidity":40,"mic":"CHECKSUM","protocol":"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas","rssi":-70,"duration":117996}
                                                                                                                              E: [ WebUI ] webUIQueue full, discarding signal RTL_433toMQTT
                                 N: Send on /RTL_433toMQTT/Acurite-Tower/A/7575 msg {"model":"Acurite-Tower","id":7575,"channel":"A","battery_ok":1,"temperature_C":19.1,"humidity":38,"mic":"CHECKSUM","protocol":"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas","rssi":-49,"duration":180000}
                                  Guru Meditation Error: Core  1 panic'ed (Unhandled debug exception).
Debug exception reason: Stack canary watchpoint triggered (rtl_433_Decoder)
Core  1 register dump:
PC      : 0x400d7914  PS      : 0x00060836  A0      : 0x800d7cda  A1      : 0x3ffe0a80
A2      : 0x3ffe0ab8  A3      : 0x3ffe1240  A4      : 0x00060820  A5      : 0x00060823
A6      : 0x00060823  A7      : 0x00000001  A8      : 0x800d7914  A9      : 0x3ffe0a50
A10     : 0x3ffe0ad0  A11     : 0x3ffe0ab8  A12     : 0x00000300  A13     : 0x3ffe1240
A14     : 0x007bfb48  A15     : 0x003fffff  SAR     : 0x0000001a  EXCCAUSE: 0x00000001
EXCVADDR: 0x00000000  LBEG    : 0x40089de0  LEND    : 0x40089df6  LCOUNT  : 0x00000000

Backtrace: 0x400d7911:0x3ffe0a80 0x400d7cd7:0x3ffe0ab0 0x400d7e76:0x3ffe1130 0x400d7ea6:0x3ffe1150 0x400d7edd:0x3ffe1170 0x400de591:0x3ffe1190 0x40127dd9:0x3ffe1590 0x401d7666:0x3ffe15d0 0x4011abdf:0x3ffe15f0 0x401293ab:0x3ffe3050 0x4012a209:0x3ffe3070 0x40128134:0x3ffe30b0 0x400fd813:0x3ffe30e0

ELF file SHA256: fcb3515c7d3ccedf

Rebooting...
ets Jun  8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 188777542, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0030,len:1184
load:0x40078000,len:13160
load:0x40080400,len:3036
entry 0x400805e4
N:
   ************* WELCOME TO OpenMQTTGateway **************
                                                          [   172][E][Preferences.cpp:50] begin(): nvs_open failed: NOT_FOUND
N: SYS config not found
                       N: SSD1306 config initialised
                                                    [   174][E][Preferences.cpp:50] begin(): nvs_open failed: NOT_FOUND
N: No SSD1306 config to load
                            N: Setup SSD1306 Display end
                                                        N: OpenMQTTGateway Version: version_tag
                                                                                               {
  "mqtt_server": "192.168.1.127",
  "mqtt_port": "1883",
  "mqtt_user": "mqtt_hass",
  "mqtt_pass": "REDACTED",
  "mqtt_topic": "home/",
  "gateway_name": "OMG_lilygo_rtl_433_ESP_2",
  "mqtt_broker_secure": false,
  "mqtt_broker_cert": "",
  "mqtt_ss_index": 0,
  "ota_server_cert": "",
  "ota_pass": "REDACTED"
}N: Config loaded from flash
                            *wm:[2] Added Parameter: server
*wm:[2] Added Parameter: port
*wm:[2] Added Parameter: user
*wm:[2] Added Parameter: pass
*wm:[2] Added Parameter: secure
*wm:[2] Added Parameter: cert
*wm:[2] Added Parameter: name
*wm:[2] Added Parameter: topic
*wm:[2] Added Parameter: ota
N: Attempting Wifi connection with saved AP: 0
                                              N: Attempting Wifi connection with saved AP: 1
                                                                                            E (5224) wifi:sta is connecting, return error
[  2428][E][WiFiSTA.cpp:317] begin(): connect failed! 0x3007
N: Attempting Wifi connection with saved AP: 2
                                              E (6233) wifi:sta is connecting, return error
[  3437][E][WiFiSTA.cpp:317] begin(): connect failed! 0x3007
[  6471][E][Preferences.cpp:50] begin(): nvs_open failed: NOT_FOUND
N: No WebUI config to load
                          N: OpenMQTTGateway URL: http://192.168.1.215/
                                                                       N: ZwebUI setup done
pvince commented 10 months ago

One more small update here, I reverted to 1.6.0 and it seems to be running fine again. (At least, it has run for ~5 minutes without any crashes, which is longer than it ever ran on 1.7.0)

I am still completely willing to help out w/ testing any other firmware you have, just wanted to add another data point that downgrading via the captive portal got me running again.

NorthernMan54 commented 10 months ago

Humm Debug exception reason: Stack canary watchpoint triggered (rtl_433_Decoder)

The rtl_433_Decoder stack size is set here

https://github.com/NorthernMan54/rtl_433_ESP/blob/3fea1cf678212ea5ef70e38f625bc8505f73bb28/src/signalDecoder.cpp#L38

Will need to do some regression testing with 1.7 to see what happened to trigger the stack overflow

pvince commented 10 months ago

Would it help if I provided a log of almost all MQTT messages sent from my various OMG 433 receivers?

Short background: I have 4x Lilygo devices flashed with the OMG 433 firmware. I have a service that is logging all OMG 433 MQTT messages for the past week to a database, and I can probably create a CSV of topic, message

Long background: I also have ~25 acurite temperature // humidity sensors all over the house & outside. Each sensor sends a message about ~15 seconds. Each message is picked up by typically at least 2-3 of the receivers, if not all of them.

This created a massive torrent of MQTT messages, so I wrote a small service ( https://github.com/pvince/omg-acurite-filter ) that filters out bad readings, throttles the messages, and forwards them to Home Assistant.

In debugging that service I have found it useful to have a log of all messages the service received so I could replay problem scenarios, I should be able to go back to before I upgraded and dump out an hour or so of typical data.

1technophile commented 10 months ago

Could you share the logs with TRACE log level please? You can change it in the WebUI.

NorthernMan54 commented 10 months ago

Not sure what I would do with that data, as the library does not have the ability to inject those message. But thank you for the offer.

Between OMG 1.6 and 1.7 the only differences within the rtl_433_ESP library are - https://github.com/NorthernMan54/rtl_433_ESP/compare/v0.3.0...v0.3.1

Looking back those changes, there is nothing obvious that would trigger the canary as no changes were made to the rtl_433_Decoder function.

If you request a status message from rtl_433_ESP, it should respond with the stack HWM.

OMG 1.6 rtl_433 status request

home/OMG_4C7525A8D5DC/commands/MQTTtoRTL_433 {"status":1}

OMG 1.7 Status Request https://docs.openmqttgateway.com/use/rf.html#rtl-433-device-decoders

With 1.6 response is on the topic - home/OMG_4C7525A8D5DC/RTL_433toMQTT/status

{"model":"status","protocol":"rtl_433_ESP status message","modulation":"OOK","RTLRssi":-99,"RTLAVGRssi":-101,"RTLRssiThresh":-92,"signalRssi":-89,"RTLOOKThresh":15,"train":1,"RTLCnt":958117,"totalSignals":12,"signalRatio":0,"ignoredSignals":11,"unparsedSignals":0,"StackHWM":3132,"RTL_HWM":244,"DCD_HWM":1224,"freeMem":117492,"_enabledReceiver":1,"receiveMode":0}

In that status message, DCD_HWM is the stack that had the canary issue. Watching that over time with 1.7 versus 1.6 would be very interesting.

PS - That is from my test Lilygo board, uptime is 2471102 ( last power flicker here )

{"uptime":2471102,"version":"North-LilyGo-v1.5.0-70-g8624836b[localDev]","discovery":false,"ohdiscovery":false,"env":"North-LilyGo","freemem":117524,"mqttport":"1883","mqttsecure":false,"minfreemem":25520,"tempc":51.11111,"freestack":3132,"rssi":-20,"SSID":"The_Beach","BSSID":"20:C9:D0:1A:CA:85","ip":"192.168.1.69","mac":"4C:75:25:A8:D5:DC","actRec":3,"mhz":433.92,"RTLRssiThresh":-92,"RTLRssi":-95,"RTLAVGRssi":-101,"RTLCnt":958220,"RTLOOKThresh":15,"modules":["LILYGO_OLED","WebUI","BME280","rtl_433"]}

pvince commented 10 months ago

esp-web-tools-logs.txt

Attached the logs from the latest session after enabling 'TRACE' level logging.

Note: This is using the firmware provided by @1technophile last night. I can redo this w/ the official 1.7.0 firmware if desired.

1technophile commented 10 months ago

Thanks, no need to redo it, the difference was only to decrease the minimum memory trigger for reboot. Could you try to deactivate discovery please ?

pvince commented 10 months ago

esp-web-tools-logs (2).txt

Turned off discovery, still crashing. I think the log level reset itself back to whatever the default is for a period of time, but I think I managed to re-enable the logging back to TRACE for at least some of this log.

AndersHoglund commented 10 months ago

I have the same problem. How do you revert to v1.6 when it constantly crashes and reboots? I tried to supply the URL: https://github.com/1technophile/OpenMQTTGateway/releases/download/v1.6.0/lilygo-rtl_433-firmware.bin in WebUI and hit "Start Upgrade", but my liligo crashes and reboots before upgrade is finished (or even starts). Tried clearing all credentials and only re-add needed WiFi config. Also tried turning of RF. Same result, constant reboots. USB? It shows up as a "USB-Enhanced-SERIAL CH9102" device in Windows. How to up/down-grade in that case?

1technophile commented 10 months ago

I have the same problem. How do you revert to v1.6 when it constantly crashes and reboots?

After erasing the credentials, from the config portal access point you can update the firmware by tapping "Update". You can download the v1.6.0 binary from https://github.com/1technophile/OpenMQTTGateway/releases/tag/v1.6.0

1technophile commented 10 months ago

Can one of you test this firmware ?

lilygo-rtl-433-test-firmware.zip

pvince commented 10 months ago

I have the same problem. How do you revert to v1.6 when it constantly crashes and reboots? I tried to supply the URL: https://github.com/1technophile/OpenMQTTGateway/releases/download/v1.6.0/lilygo-rtl_433-firmware.bin in WebUI and hit "Start Upgrade", but my liligo crashes and reboots before upgrade is finished (or even starts). Tried clearing all credentials and only re-add needed WiFi config. Also tried turning of RF. Same result, constant reboots. USB? It shows up as a "USB-Enhanced-SERIAL CH9102" device in Windows. How to up/down-grade in that case?

This is probably not the most efficient method, but this is how I did it:

  1. Use the OMG Website Upload (with the clear memory checked) to install the 1.7.0 firmware again. This reset my device back to the unconfigured state.
  2. Download the 1.6.0 firmware to my phone
  3. Use my phone to connect to the OMG device
  4. Disable my phone's data to force it to try and use the captive portal's wifi rather than bypassing it.
  5. Open up a browser on my phone to http://192.168.4.1
  6. Click the 'upgrade firmware button'
  7. Browse for the downloaded firmware (should be a bin file, if it was a zip, extract the zip)
  8. Upload the firmware and after the device restarts, you are back to 1.6.0

Currently working away from my desk, I will need to try out the latest firmware later this evening.

pvince commented 10 months ago

@1technophile Uploaded the latest test firmware, still crashing. Logs below (Turned on trace logging & disabled discovery) esp-web-tools-logs.txt

1technophile commented 10 months ago

Not sure how familiar you are with PlatformIO ? It would be interesting to have the decoding of the exception thanks to it. This could be done by adding monitor_filters = esp32_exception_decoder to the environment, for your board it would be:

[env:lilygo-rtl_433]
platform = ${com.esp32_platform}
board = ttgo-lora32-v21
; ~/.platformio/packages/framework-arduinoespressif32/variants/.../pins_arduino.h
board_build.partitions = min_spiffs.csv
monitor_filters = esp32_exception_decoder
lib_deps =
  ${com-esp32.lib_deps}
  ${libraries.ssd1306}
  ${libraries.rtl_433_ESP}
build_flags =
  ${com-esp32.build_flags}
; *** OpenMQTTGateway Config ***
  ;'-UZmqttDiscovery'          ; disables MQTT Discovery
  '-Drtl_433_Decoder_Stack=30000'
  '-DvalueAsATopic=true'    ; MQTT topic includes model and device
  '-DGateway_Name="OMG_lilygo_rtl_433_ESP"'
; *** OpenMQTTGateway Modules ***
  '-DZgatewayRTL_433="rtl_433"'
  '-DZradioSX127x="SX127x"'
; *** ssd1306 Display Options ***
  '-DZdisplaySSD1306="LilyGo_SSD1306"'

You can upload this environment, activate the serial monitor, and the exception should be decoded automatically.

1technophile commented 10 months ago

I found maybe something, could you try these three firmware and let me know if one or several works for you firmware-lilygo-test.zip

Thanks for your help. I don't have the issue with my RTL_433 boards, so I rely on your tests here.

AndersHoglund commented 10 months ago

Thanks guys, my Lilygo RTL 433 are now back on 1.6 and working. Funny thing is that it still prompts for the 1.7 update in Home Assistant, should that version not be withdrawn asap?

pvince commented 10 months ago

Trying out a different serial console with these logs (hterm)

image

Just saw the comment about platform.io, I am familiar with it, but only on a very light level. I will check that out later tonight

1technophile commented 10 months ago

Thanks for these tests, I will propose other tests later on. No need to setup platformio unless you want to try it.

1technophile commented 10 months ago

Here are 2 other firmware for testing: firmware_dynamic-tests.zip

oscarb commented 10 months ago

I think I have a related issue. I updated my DollaTek LoRa32 V2.1 433Mhz ESP32 to 1.7.0 from 1.6.0 yesterday using the update functionality provided via MQTT to Home Assistant. Updating itself went fine.

However, after updating I've noticed that pressing buttons on my 433 MHz remotes now makes my LilyGO restart. My 433 MHz temperature sensors work fine, but my NEXA remotes and motion sensors seems to restart the LilyGO.

I updated the log level but no relevant message makes it to the console in the WebUI or on the hardware display before it restarts itself.

Not sure if it helps but the remotes have this information:

NEXA Model: PET-910 Frequency: 433.92MHz Date code: 1709

While these devices wouldn't automatically translate into useful entities in Home Assistant, I could still make use of the MQTT events from OMG to create automations.

Is there anything I can do to help troubleshoot this further...? I have only flashed the ESP32 using the OMG website before, how would I go about flashing the firmwares you uploaded here...?

image

1technophile commented 10 months ago

Is there anything I can do to help troubleshoot this further...? I have only flashed the ESP32 using the OMG website before, how would I go about flashing the firmwares you uploaded here...?

You need to erase the credentials with the WebUI, and from the onboarding portal, click update and upload the file firmware to test.

pvince commented 10 months ago

Here are 2 other firmware for testing: firmware_dynamic-tests.zip

Firmware tests:

1technophile commented 10 months ago

Thanks, I see that the stack memory is decreasing before each crash, could you try this one, it has DEBUG activated for RTL_433 and more stack memory firmware-rtl-debug.zip

pvince commented 10 months ago
1technophile commented 10 months ago

I'm not sure the log comes from the good binary : *wm:[2] [OTA] Update file: lilygo-rtl-433-test-firmware.bin

We should see in the logs the stack size rtl_433_Callback stack1:

Giving it again in case this is on my side. firmware-rtl_debug2.zip

1technophile commented 10 months ago

In the meantime we investigate the root cause I went back to the previous publication method in the development branch: https://github.com/1technophile/OpenMQTTGateway/pull/1843 So that from tomorrow you can pick up the v1.7.X development here https://docs.openmqttgateway.com/dev/

pvince commented 10 months ago

Woops... I must have fat fingered the wrong binary when uploading from my phone.

I think I got it right this time, no crash, but I am assuming because of the debug flag there is a ton of output

I will upload firmware-rtl_debug2.zip next

1technophile commented 10 months ago

Great to know that this is stable, let me push you a new binary without the RTL trace

1technophile commented 10 months ago

Here is it firmware-25k-stack.zip

pvince commented 10 months ago
1technophile commented 10 months ago

Thanks, this is very useful, we may end up tuning the stack to 15k and we should be good

NorthernMan54 commented 10 months ago

@pvince We are attempting to recreate the issue, and believe it is related to the number of signals your are receiving. Can you share some statistics around the number of messages being received ? Am thinking your are receiving a significant number of messages per minute or interval that is triggering the issue.

This message E: [ WebUI ] webUIQueue full, discarding signal RTL_433toMQTT indicates that this queue is full, which is indicative of a high message volume. ( Queue size is 5, and it reads messages off the queue every 3 seconds ). If messages consistently arrive more frequently than 1 every 3 seconds, this queue can't keep up, and it drops messages from the WebUI and OLED display.

pvince commented 10 months ago

I have about 23 Acurite-Tower devices, each device sends out a signal about every 15 seconds. I also have a Maverick BBQ probe that sits in my oven, and it sends out signals very frequently (not sure exact frequency, at least as often as the Acurite sensors).

I also live in a suburban area and I see a good number of random neighbors devices throughout the day.

In a worst case scenario, that comes out to about 96 messages a minute. In reality the sensors are spread around my house and yard, and I have 4x OMG 433 receivers spread around the house to ensure I receive all the signals. From the service I wrote, I know that those 4x combined receivers are publishing about 180 / messages a minute. Assuming even coverage where each receiver is splitting the evenly (unlikely) we get an average of about 45 messages / minute per OMG device.

So I would say the worst case is somewhere between 50 to 90 messages / minute (I know this is a wide range)

I am out of the house now, but I can provide more accurate numbers when I get home.

AndersHoglund commented 10 months ago

In contrast to pvince I only have 2 sensors, and still my one and only Liliygo rtl 433 crashed and rebooted with 1.7.

Typical log entries with v1.6. ..... N: Send on /SYStoMQTT msg {"uptime":90372,"version":"v1.6.0","discovery":false,"ohdiscovery":false,"env":"lilygo-rtl_433","freemem":114692,"mqttport":"1883","mqttsecure":false,"minfreemem":37952,"tempc":51.11111,"freestack":3132,"rssi":-58,"SSID":"NETatONCE-47C2","BSSID":"44:D4:37:82:47:C2","ip":"192.168.26.241","mac":"D4:D4:DA:9E:71:EC","actRec":3,"mhz":433.92,"RTLRssiThresh":-90,"RTLRssi":-100,"RTLAVGRssi":-99,"RTLCnt":6667,"RTLOOKThresh":15,"modules":["LilyGo_SSD1306","WebUI","rtl_433"]} N: Send on /SSD1306toMQTT msg {"onstate":true,"brightness":50,"display-flip":true,"idlelogo":true,"log-oled":false,"json-oled":true} N: Send on /WebUItoMQTT msg {"displayMetric":true,"webUISecure":true,"displayQueue":1} N: Send on /RTL_433toMQTT/Oregon-THGR122N/2/203 msg {"model":"Oregon-THGR122N","id":203,"channel":2,"battery_ok":1,"temperature_C":1.5,"humidity":80,"protocol":"Oregon Scientific Weather Sensor","rssi":-74,"duration":385996} N: Send on /RTL_433toMQTT/Oregon-THGR122N/1/37 msg {"model":"Oregon-THGR122N","id":37,"channel":1,"battery_ok":1,"temperature_C":2.3,"humidity":87,"protocol":"Oregon Scientific Weather Sensor","rssi":-67,"duration":385996} N: Send on /RTL_433toMQTT/Oregon-THGR122N/2/203 msg {"model":"Oregon-THGR122N","id":203,"channel":2,"battery_ok":1,"temperature_C":1.5,"humidity":80,"protocol":"Oregon Scientific Weather Sensor","rssi":-75,"duration":385996} N: Send on /RTL_433toMQTT/Oregon-THGR122N/1/37 msg {"model":"Oregon-THGR122N","id":37,"channel":1,"battery_ok":1,"temperature_C":2.3,"humidity":87,"protocol":"Oregon Scientific Weather Sensor","rssi":-66,"duration":385993} N: Send on /RTL_433toMQTT/Oregon-THGR122N/2/203 msg {"model":"Oregon-THGR122N","id":203,"channel":2,"battery_ok":1,"temperature_C":1.5,"humidity":80,"protocol":"Oregon Scientific Weather Sensor","rssi":-75,"duration":382001} N: Send on /RTL_433toMQTT/Oregon-THGR122N/1/37 msg {"model":"Oregon-THGR122N","id":37,"channel":1,"battery_ok":1,"temperature_C":2.3,"humidity":87,"protocol":"Oregon Scientific Weather Sensor","rssi":-82,"duration":383996} N: Send on /SYStoMQTT msg {"uptime":90492,"version":"v1.6.0","discovery":false,"ohdiscovery":false,"env":"lilygo-rtl_433","freemem":114916,"mqttport":"1883","mqttsecure":false,"minfreemem":37952,"tempc":51.11111,"freestack":3132,"rssi":-60,"SSID":"NETatONCE-47C2","BSSID":"44:D4:37:82:47:C2","ip":"192.168.26.241","mac":"D4:D4:DA:9E:71:EC","actRec":3,"mhz":433.92,"RTLRssiThresh":-91,"RTLRssi":-104,"RTLAVGRssi":-100,"RTLCnt":6677,"RTLOOKThresh":15,"modules":["LilyGo_SSD1306","WebUI","rtl_433"]} N: Send on /SSD1306toMQTT msg {"onstate":true,"brightness":50,"display-flip":true,"idlelogo":true,"log-oled":false,"json-oled":true} N: Send on /WebUItoMQTT msg {"displayMetric":true,"webUISecure":true,"displayQueue":1} .....

NorthernMan54 commented 10 months ago

To attempt to recreate the issue, I used two of my cc1101 based OMG devices to send a signal fake signals, and could not recreate the issue. They injected about 1 message a second into the environment.

image
oscarb commented 10 months ago

Here is it firmware-25k-stack.zip

I tried this firmware and the good news is that my LilyGO no longer restarts when receiving button presses and motion sensor events. However, the MQTT messages sent from OMG now look completely different.

Before a button press on my remote would look something like this

Message 440 received on home/OMG_lilygo_rtl_433_ESP/RTL_433toMQTT/Nexa-Security/3/27478810 at 19:48:
{
    "model": "Nexa-Security",
    "id": 27478810,
    "channel": 3,
    "state": "ON",
    "unit": 3,
    "group": 0,
    "protocol": "Nexa",
    "rssi": -53,
    "duration": 412996
}

Now, with the new firmware they look like this:

Message 37 received on home/OMG_lilygo_rtl_433_ESP/RTL_433toMQTT/Nexa-Security/3/27478810 at 18:36:
{
    "model": "Proove-Security",
    "y": 27478810,
    "d": 3,
    "l": "e",
    "N": 3,
    "t": 0,
    "p": "l",
    "ove / Nexa / KlikAanKlikUit Wireless Switch": -78,
    " Nexa / KlikAanKlikUit Wireless Switch": 436000
}

A little bit more difficult to interpret but I guess I could guess what translates to what 😅 Would you like me to open a new issue or keep using this one?

1technophile commented 10 months ago

Indeed :-) Could you try the development version and see if you have the same issue: Https://docs.openmqttgateway.com/dev

pvince commented 10 months ago

I will try the development version, but I am seeing the same thing that @oscarb reported with regard to garbled MQTT data messages.

As previously mentioned, I have a program that is logging all MQTT data received to a database. It considers an MQTT message an 'unknown type' if the received message:

  1. Can not be parsed as JSON
  2. Is parsed as JSON, but does not have an id property.

I typically get some 'unknown' messages throughout the day from neighbors devices, mostly from RTL_433 devices that lack IDs, eg: Generic-Motion.

However, since running the 25k-stack firmware, I am noticing a much higher rate of 'unknown' devices, the data is as follows: Date Total Received Messages Messages w/ Unknown Type % Unknown
2023-12-08 73719 24 0.03%
2023-12-09 232855 88 0.04%
2023-12-10 234941 81 0.03%
2023-12-11 214222 87 0.04%
2023-12-12 232671 207 0.09%
2023-12-13 233935 85 0.04%
2023-12-14* 225264 56 0.02%
2023-12-15 162897 274 0.17%
2023-12-16 146474 63 0.04%
2023-12-17 192341 52 0.03%
2023-12-18 170428 106 0.06%
2023-12-19** 152475 1436 0.94%
2023-12-20 97497 92 0.09%

* = Initial upgrade date ** = Upgrade to 25k stack

Dates are in UTC, didn't realize until after I built the table.

Example bad data I am receiving:

{
    "message": "{\"model\":\"Acurite-Tower\",\"id\":15652,\"channel\":\"C\",\"battery_ok\":0,\"temperature_C\":20.7,\"humidity\":50,\"mic\":\"CHECKSUM\",\"protocol\":\"Acurite 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas\",\"rssi\":-72,\"duration\":124996}1�\u0003\u0000K433_direct/raw/OMG_lilygo_rtl_433_ESP_2/RTL_433toMQTT/Acurite-Tower/C/15652{\"model\":\"Acurite-Tower\",\"id\":15652,\"channel\":\"C\",\"battery_ok\"",
    "topic": "433_direct/raw/OMG_lilygo_rtl_433_ESP_2/RTL_433toMQTT/Acurite-Tower/C/15652"
}

Everything in the message text is the MQTT message my service received. This one had invalid JSON, and failed parsing.

{
    "data": {
        "": 1,
        " Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas": "p/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas",
        "A": 2247,
        "as\",\"rssi\":-72,\"duration\":298997}": -72,
        "attery_ok": "k",
        "e 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas": 877.823,
        "ic": 4.3112,
        "model": "Acurite-Tower",
        "otocol": 22.5,
        "r": 49,
        "rssi\":-72,\"duration\":298997}": 298997,
        "temperature_C": 0,
        "y, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas": "Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas"
    },
    "message": "{\"model\":\"Acurite-Tower\",\"r\":49,\"A\":2247,\"attery_ok\":\"k\",\"temperature_C\":0,\"\":1,\"ic\":4.3112,\"otocol\":22.5,\"e 592TXR Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas\":877.823,\" Temp/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas\":\"p/Humidity, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas\",\"y, 592TX Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas\":\"Temp, 5n1 Weather Station, 6045 Lightning, 899 Rain, 3N1, Atlas\",\"as\\\",\\\"rssi\\\":-72,\\\"duration\\\":298997}\":-72,\"rssi\\\":-72,\\\"duration\\\":298997}\":298997}",
    "topic": "433_direct/raw/OMG_lilygo_rtl_433_ESP_2/RTL_433toMQTT/Acurite-5n1/A/2247"
}

This one looks like the JSON got really munged up, but was still parseable as JSON.

I will try the in development version and see if this issue continues or clears up.

**Edit** Went to upgrade my board that was running the 25k-stack firmware, and it was locked up on the black background logo boot screen. I rebooted it & flashed it to the development version via the WebUI. I will post with any updates.

pvince commented 10 months ago

Number of bad messages has definitely increased, even with the dev build:

Timespan Total Received Messages Messages w/ Unknown Type % Unknown
14 hours 106141 1291 1.21%

If needed, I can provide information about specific messages.

oscarb commented 10 months ago

Indeed :-) Could you try the development version and see if you have the same issue: Https://docs.openmqttgateway.com/dev

For some reason, my LilyGO no longer appears in the list of devices when trying "Option 1: Upload from the web"... Tried both from macOS with the ch34xser drivers installed as well as from a windows computer, Chrome and Edge and different USB cables... Could it be due to me having flashed it with OpenMQTTGateway already...? And if so, how do I go about completely resetting the device?

its-grandpa commented 10 months ago

This is an excellent software, I've been trying it for a few weeks, with lilygo board. In my environment, the most stable version is 1.5 The 1.7 didn't start, the 1.6 restarted because of the GT-WT02 sensor, which is not on me, I don't know if it exists at all. I just uploaded the dev version (SHA:f2d323). Unfortunately, this also restarts due to the GT-WT02 sensor, error message example: (temp. too cold)

N: Send on /RTL_433toMQTT/GT-WT02/4/144 msg {"model":"GT-WT02","id":144,"channel":4,"battery_ok":1,"temperature_C":-152.5,"humidity":41,"button":1,"mic":"CHECKSUM","protocol":"Globaltronics GT-WT-02 Sensor","rssi":-78,"duration":884000}
CORRUPT HEAP: Bad head at 0x3ffe9ff8. Expected 0xabba1234 got 0xbaad5678

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)

I agree with pvince ,number of bad messages has increased. Unfortunately, restarts happen too often.

1technophile commented 9 months ago

@pvince

Number of bad messages has definitely increased, even with the dev build:

Could you share examples of bad messages with the current development version please ?

1technophile commented 9 months ago

@oscarb

For some reason, my LilyGO no longer appears in the list of devices when trying "Option 1: Upload from the web"

What was the name of your environment (should be into SYStoMQTT) ?

oscarb commented 9 months ago

What was the name of your environment (should be into SYStoMQTT) ?

lilygo-rtl_433