1technophile / OpenMQTTGateway

MQTT gateway for ESP8266 or ESP32 with bidirectional 433mhz/315mhz/868mhz, Infrared communications, BLE, Bluetooth, beacons detection, mi flora, mi jia, LYWSD02, LYWSD03MMC, Mi Scale, TPMS, BBQ thermometer compatibility & LoRa.
https://docs.openmqttgateway.com
GNU General Public License v3.0
3.6k stars 792 forks source link

Debugging instability of Lilygo_rtl_433_ESP #2043

Closed puterboy closed 3 weeks ago

puterboy commented 2 months ago

I was doing some analysis of 'uptimes' by looking through old backups of home_assitant_v2.db and extracting times between reboots of the Lilygo_rtl_433_ESP device (running on a Lilygo LORA esp32).

I noticed that the device typically reboots every 10-60 hours looking back over the past 3 months.

Note:

Any suggestions on possible causes and how to investigate?

puterboy commented 2 months ago

Attached is a graph of free mem over the past 24 hours. There were 2 reboots at about 5:36 AM and 5:16 PM as you can see by the temporary positive memory spikes with the reboot: image

And here is a graph showing memory usage and uptime before I upgraded and started playing with the code: image image

Really hard to see an obvious memory problem unless suddenly out-of-the-blue it got swamped with simultaneous messages -- note that based on the "quantum"-like pattern in memory usage, it seems like a single message uses up about 656 bytes of memory.

1technophile commented 2 months ago

Idealy a serial monitor with exception decoder activated would help along with the logs associated with the reboot. I know that it requires to have a computer connected but for now this is the best solution I have

puterboy commented 2 months ago

I ordered a spare Lilygo LORA, so I can hook that up to a serial monitor when it arrives. Also, how do activate "exception decoder"?

1technophile commented 2 months ago

By adding this line into your environment:

monitor_filters = esp32_exception_decoder
piechade commented 1 month ago

I ordered a spare Lilygo LORA, so I can hook that up to a serial monitor when it arrives. Also, how do activate "exception decoder"?

Any news about the issue, my device arrived today I have maybe the same issue. My 5 in 1 bresser is detected as 6 in 1 what's also wrong. Not sure if this is the problem or there is general something wrong.

puterboy commented 1 month ago

By adding this line into your environment:

monitor_filters = esp32_exception_decoder

OK - I got my new board and it too reboots -- even more often, every few hours actually!!

I will recompile with monitor_filters = esp32_exception_decoder @1technophile Am I correct in assuming that the output goes to the serial monitor (via the USB connector on the ESP32 board)?

1technophile commented 1 month ago

Am I correct in assuming that the output goes to the serial monitor (via the USB connector on the ESP32 board)

This is correct

Forge36 commented 1 month ago

Is yours fully rebooting? I'm finding the restart connects to WiFi, no 433 radio or mqtt (some sort of limited web panel) a second restart works.

puterboy commented 1 month ago

Reboot automatically and fully every number of hours - highly variable - seems to reboot even more on my new one...

puterboy commented 1 month ago

OK - I have been monitoring serial messages with the 'monitor_filters' command for the past day and have seen the following:

  1. The error message around a crash is:
    
    Guru Meditation Error: Core  1 panic'ed (Unhandled debug exception).
    Debug exception reason: Stack canary watchpoint triggered (rtl_433_Decoder)
    Core  1 register dump:
    PC      : 0x4008df3b  PS      : 0x00060a36  A0      : 0x4008dd4a  A1      : 0x3ffe12f0
    A2      : 0x00000000  A3      : 0x4008abb1  A4      : 0x00000000  A5      : 0x3ffe13c0
    A6      : 0x0000014c  A7      : 0x0000000c  A8      : 0x00000009  A9      : 0x00000024
    A10     : 0x0000005a  A11     : 0x0000002a  A12     : 0x40084f96  A13     : 0x3f442100
    A14     : 0x00ff0000  A15     : 0xff000000  SAR     : 0x00000018  EXCCAUSE: 0x00000001
    EXCVADDR: 0x00000000  LBEG    : 0x4008abb1  LEND    : 0x4008abc1  LCOUNT  : 0xfffffffc

Backtrace: 0x4008df38:0x3ffe12f0 0x4008dd47:0x3ffe16c0 0x400d6360:0x3ffe1780 0x400d6635:0x3ffe1870 0x400e7ec5:0x3ffe1890 0x401dab3d:0x3ffe18b0 0x401dc849:0x3ffe18e0 0x400f11ab:0x3ffe1900 0x400d67fe:0x3ffe1930 0x400d6a9e:0x3ffe1980 0x400e230e:0x3ffe19a0 0x4012bfc5:0x3ffe1de0 0x401dc58a:0x3ffe1e20 0x401200af:0x3ffe1e40 0x40120142:0x3ffe3890 0x4012d597:0x3ffe38c0 0x4012dc15:0x3ffe38e0 0x4012c300:0x3ffe3970 0x4010121f:0x3ffe39a0


2. A couple of the crashes were missing the "Debug exception reason" line so not sure what caused them -- could it be that the error was the same but it wasn't caught by the debugger?

3. I also see occasional error messages of form:

bitbuffer_add_bit: Warning: row count limit (50 rows) reached


But those messages don't seem to be temporally related to the actual crashes.

Is this helpful????
puterboy commented 1 month ago

Some additional observations: 1 From googling it seems like this is due to stack overflow (which was also mentioned in https://github.com/1technophile/OpenMQTTGateway/issues/1655) but not clear what the root cause of this stack overflow is

  1. There doesn't seem to be any memory hole per say, as the free memory as reported every 2 minutes never seems to dip below about 100K -- and one crash even occurred just after such a memory report and there was 110K free
  2. This bug may be similar to https://github.com/1technophile/OpenMQTTGateway/issues/1836 where some posters mention a similar Stack canary watchpoint triggered (rtl_433_Decoder) stack trace (though that bug seems to have been fixed for at least some users by removing the json mqtt stack which seemingly is unrelated to rtl_433_Decoder plus I think my earlier PR fixed the root cause of the MQTT stack issue which was due to running out of memory and in this case free memory seems to be fine)
NorthernMan54 commented 1 month ago

The rtl_433_Decoder stack size is not externally visible, try increasing this by about 1000

rtl_433_Decoder

puterboy commented 1 month ago

OK increased rtl_433_Decoder_Stack (under OOK_MODULATION) from 10000 to 11000. Will watch and see...

Any idea what the root cause of this is? Or is the stack truly (marginally) not big enough so that every once in a while it overflows and crashes? Also, is there any simple way to add a log 'warning' when stack is full or near full to help future debugging?

NorthernMan54 commented 1 month ago

@puterboy I had tuned the stack size based on my real world results in a attempt to balance free memory versus not using too many resources. Within the OMG environment we have multiple modules sharing resources so if one gets greedy, the others suffer.

NorthernMan54 commented 1 month ago

@puterboy PS If you have a value that works for you, pls submit a PR or issue against rtl_433_ESP, and I will try to include in the next release

1technophile commented 1 month ago

Also, is there any simple way to add a log 'warning' when stack is full or near full to help future debugging?

Yes, you can have the stack available at the task level by using this uxTaskGetStackHighWaterMark(NULL) and if you want to use this function outside of the task, you replace the NULL with the task Handle rtl_433_DecoderHandle. My advice, add this to the stateRFMeasures and follow the value with MQTT explorer.

A PR will be welcome to add this indicator.

puterboy commented 1 month ago

Sure. I added the code to SYStoMQTT based on your original comment. That seemed to make sense since stack memory is similar to freemem. I also added a sensor for it in ZmqttDiscovery.ino but that is probably overkill... Inclusion is conditional based on #ifdef ZgatewayRTL_433

I see you edited your comment and are now suggesting stateRFMeasurees -- would you prefer it there? (It's obviously easy to change :) )

1technophile commented 1 month ago

I see you edited your comment and are now suggesting stateRFMeasurees -- would you prefer it there?

Yes please, this way it would be with the other RF infos

puterboy commented 1 month ago

Stack overflow is definitely at least the primary issue here. I increased the stack size from 10K to 11K and after a few hours the high water mark dipped below 1000 which would have caused a crash if size were still 10K.

I will continue to monitor to see if increasing by 1K is sufficient.

Is there any way to dynamically change the stack size if high water mark dips too low? Say redeclare function??? I say that because every situation may be different. Other people may have more or different devices resulting in different stack consumption.

puterboy commented 1 month ago

I see you edited your comment and are now suggesting stateRFMeasurees -- would you prefer it there?

Yes please, this way it would be with the other RF infos

I assume that if in this section, for consistency, no need for sensor discovery logic to be added.

1technophile commented 1 month ago

I assume that if in this section, for consistency, no need for sensor discovery logic to be added.

Indeed, as we do with BT we don't autodiscover the modules task free stack

1technophile commented 1 month ago

Is there any way to dynamically change the stack size if high water mark dips too low? Say redeclare function???

We could delete it and recreate it with a different size but I'm not sure if this approach is relevant.

puterboy commented 1 month ago

Comparing the two Lilygo rtl433's I have, I noticed that the one upstairs resets more often than the one downstairs -- and correspondingly, the one upstairs is receiving signals from more 433MHz devices.

Presumably, the more devices transmitting, the more the queue grows, leading to more stack consumption if multiple signals are received in close approximation... Assuming that is correct, it would seem that there is no "right" value for the stack size as some people may live in an isolated area with only a few of their own 433MHz devices and no neighbors while others may live in a city with many of their own devices plus neighbor ones.

If true, then it would seem that one should implement a method to dynamically resize the queue if it dips too low. One would presumably need to copy over the stack so that nothing is lost before setting up the task again...

Otherwise, the stack size seems quite arbitrary and one either needs to set it so large that nobody fails in which case memory is wasted or set it an "average" user's size in which case some people will get crashes.

Perhaps it would be helpful to get @NorthernMan54's input...

puterboy commented 1 month ago

Also, since arrival times of different devices are generally stochastic and statistically independent, after enough time you will have an instance where all your devices send a signal at approximately the same time... so the worst case eventually happens...

NorthernMan54 commented 1 month ago

@puterboy Pls keep in mind that the majority of the rtl_433_ESP code base is a direct clone from rtl_433, which runs on machines with a large memory footprint. So the stack usage is high. If you look, a couple of features from rtl_433 require a huge stack ( almost all the memory on a ESP32 ). So current stack size is a bit of a comprise.

The code base takes each received signal, allocates a spot on the heap ( not stack ), and queues it for processing thru the rtl_433 code base ( the 100+ device decoders ). Am thinking that one of the decoders, when it sees a particular signal, is consuming the stack. The processing queue has a max size of 5, so the memory hit from a large number of signals is managed, and hits the heap ( so this queing is not the source of the stack issue ).

After working recently with esp32-s3 with 8Mb ram, am wonder if running with a larger memory footprint may be beneficial. Hopefully a board with a display and a SX127x chip will be built by a manufacturer, keeping it simple like the lilgygo.

Free Heap: 165,756
Free PSRAM: 8,078,267
Free Stack: 14,612
puterboy commented 1 month ago

What you say about one of the decoders consuming the stack is perhaps consistent with the following: After running for about 36 hours, one of my LilyGo devices suddenly dropped from 3176 bytes free on the stack to 856 bytes on the stack (this would have caused a crash had I not increased the size by 1000KB). This sudden drop suggests perhaps that there is some device that issue a signal that results in a disproportionate amount of stack usage.

puterboy commented 1 month ago

@NorthernMan54 is there any debugging logging that could report how much stack space is being consumed by any given decoder along with the actual device that ended up triggering it. It would even be interesting to know whether the max stack consumption is being triggered by a device that gets decoded or by an unknown one that goes through all the candidate decoders and fails to get decoded.

NorthernMan54 commented 1 month ago

@puterboy That degree of logging does not exist within the code base. The trick would be to log the stack size between each decoder run, and then review the logs and determine which decoder was invoked that increased stack space.

puterboy commented 1 month ago

That's exactly what I was thinking. Seems like it could be worthwhile to add such logging given the persistent and difficult to debug challenges with stack consumption.

puterboy commented 4 weeks ago

I have been running with rtl_433_Decoder_Stack size 10000 for 10 days on 2 different Lilygo Lora ESP32 devices and the low water mark after 10 days is just over 600 bytes -- to give a little more margin of error for alternative sensor environments and/or worst cases, I created the following PR with stack size 11,500.

https://github.com/NorthernMan54/rtl_433_ESP/pull/156

puterboy commented 4 weeks ago

The above referenced two PRs (https://github.com/1technophile/OpenMQTTGateway/pull/2081 and https://github.com/NorthernMan54/rtl_433_ESP/pull/156) should close out this bug...

puterboy commented 4 weeks ago

@puterboy That degree of logging does not exist within the code base. The trick would be to log the stack size between each decoder run, and then review the logs and determine which decoder was invoked that increased stack space.

@NorthernMan54 are you planning on implementing this? If you do, I would be happy to enable the debugging on my end and run my spare Lilygo Lora ESP32 to gather data (assuming it would be via serial report logging).

puterboy commented 3 weeks ago

If you accept PR https://github.com/1technophile/OpenMQTTGateway/pull/2082 that will bump rtl_433_ESP to v0.3.3 which includes my patch to signalDecoder.cpp to increase size of rtl_433_Decoder_Stack as well as to expose this as a configuration variable.

puterboy commented 3 weeks ago

OK - closing this now that both PR's accepted and bug solved!