Aircoookie / WLED

Control WS2812B and many more types of digital RGB LEDs with an ESP8266 or ESP32 over WiFi!
https://kno.wled.ge
MIT License
14.62k stars 3.14k forks source link

ESP32 board rebooting on effect / preset changes #1431

Closed yetdog closed 1 year ago

yetdog commented 3 years ago

Describe the bug Replaced a working (but lousy wifi) ESP8266 with an ESP32. The ESP32 board reboots on setting effects / presets changes via HTTP interface and HomeAssistant. Sometimes changes work, but other times it will reboot the ESP32

To Reproduce Choosing preset from web interface or HomeAssistant. It will occasionally work. Something like Colortwinkles kills it every time.

Expected behavior Effect switching w/o ESP reboot

WLED version

Additional context Environment version is esp32 v3.2.3-14-gd3e562907

luigo commented 3 years ago

I am facing exactly the same issue. It happens with some of the effects, not all. I am running a Dig-Quad board with WLED 0.11.0.

yetdog commented 3 years ago

Some added context - I have a node-red flow setup to rotate effects/presets (some of which cause a crash when manually/rapidly changing) every 10 minutes, and it ran successfully all night. My current uptime sits at about 10 hours on the ESP32. It seems to crash the ESP32 when I change effects/presets too rapidly (on the order of seconds, not minutes).

Aircoookie commented 3 years ago

Thank you, I can confirm that there are apparently some serious stability issues on the ESP32 in 0.11.0. It will be hard to debug since it appears to be a very intermittent problem, but for me it didn't occur when manipulating effects or presets, but when attempting to load the UI after not accessing it for an hour or so.

Very urgent problem, need to find the cause ASAP :)

adammcclendon commented 3 years ago

I also have this problem. I have a program to send http requests sync'd to music so it is sending the request frequently, probably not any faster than every 5 to 10 seconds. It doesn't cause any issues at that speed and it works fine until it hits one of the problem presets. It freezes at that preset and has to be power cycled to get it back online. One of the times it crashed, everything on the ESP reset. I had to configure the wifi settings, led setup and presets.

ESP8266 on fresh install of 0.11.0

Aircoookie commented 3 years ago

@yetdog do you have an MQTT server configured?

yetdog commented 3 years ago

I did with my 8266’s but have not configured it in the ESP32, no.

andyshinn commented 3 years ago

I think I am running into the same thing. I managed to crash just changing effect speed and intensity a bunch of times.

yetdog commented 3 years ago

Just had it bomb on my esp8266 as well while switching presets.

Back to 0.10.2. All good except the presets don’t save segment info so I can’t easily change spacing / grouping per presets.

handsomejackuk commented 3 years ago

happens to me a few times only on specific effect i had about 15 presets on a cycle and manually found which effect crashed the board, i then deleted the preset and it seemed to work after... i didnt do a fresh install an update from an older version i wonder if previous effects when ported over to the new version have issues... maybe a fresh install would fix... not tried that yet...

darrylb83 commented 3 years ago

I too am having this issue. ESP32 on a QuinLED-Dig-Quad.

chrisgonzales commented 3 years ago

I am too having this issue esp32 with and a dig-quad, i'm open to test anything when you get around to looking at this. I'll add to it that when using the sync feature it 90% of the time reboots, with no syncing it doesn't do it all the time. I'm using the multichannel fork but if its a core thing figure i'd add my 2 cents.

Also not sure if it's related but with peek can be kinda "flickery" i'm assuming that has more to do with updates but might be a related thing with getting the colors to display.

pbolduc commented 3 years ago

Although yetdog reported a crash on esp8266, I am wondering if the issue could be related async web server running on Core-0 and the regular loop() running on Core-1. I am speculating that something could be updated like _segment_runtimes[n].reset(); which frees a data buffer and the app core not expected that. In a lot of the effects it does

if (!SEGENV.allocateData(dataSize)) return mode_static(); //allocation failed

In allocateData if the new size does not match the current, it will deallocate the buffer and re-allocate it. Each mode assumes the allocated buffer will be the correct size if it could be allocated.

Or perhaps there is race condition in the file system library. Only speculation here. Perhaps those that have crashing could post their presets.

pbolduc commented 3 years ago

I was able to reproduce using the web user interface. Create two presets, enable preset cycle and click between the two. Happens quick.

image

Preset 1

{"on":true,"bri":128,"transition":7,"mainseg":0,"seg":[{"id":0,"start":0,"stop":30,"grp":1,"spc":0,"on":true,"bri":255,"col":[[255,160,0],[0,0,0],[0,0,0]],"fx":68,"sx":128,"ix":128,"pal":0,"sel":true,"rev":false,"mi":false},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0}]}

Preset 2

{"on":true,"bri":128,"transition":7,"mainseg":0,"seg":[{"id":0,"start":0,"stop":30,"grp":1,"spc":0,"on":true,"bri":255,"col":[[255,160,0],[0,0,0],[0,0,0]],"fx":0,"sx":128,"ix":128,"pal":0,"sel":true,"rev":false,"mi":false},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0}]}
[22:01:38]rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
[22:01:38]configsip: 0, SPIWP:0xee
[22:01:38]clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
[22:01:38]mode:DOUT, clock div:2
[22:01:38]load:0x3fff0018,len:4
[22:01:38]load:0x3fff001c,len:1044
[22:01:38]load:0x40078000,len:8896
[22:01:38]load:0x40080400,len:5828
[22:01:38]entry 0x400806ac
[22:01:38]Ada
[22:02:16]Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.
[22:02:16]Core 1 register dump:
[22:02:16]PC      : 0x4018a098  PS      : 0x00060030  A0      : 0x800ecc9a  A1      : 0x3ffb1ea0  
[22:02:16]A2      : 0x00000014  A3      : 0x00000000  A4      : 0x3ffd39f4  A5      : 0x084ae0f7  
[22:02:16]A6      : 0x3ffb1f0c  A7      : 0x3ffc59c0  A8      : 0x00000014  A9      : 0x3ffb1df0  
[22:02:16]A10     : 0x00000001  A11     : 0x084ae0f7  A12     : 0x084ae0f7  A13     : 0x3ffb0000  
[22:02:16]A14     : 0x3ffc30f0  A15     : 0x3ffc143c  SAR     : 0x00000010  EXCCAUSE: 0x0000001c  
[22:02:16]EXCVADDR: 0x0000001c  LBEG    : 0x4000c28c  LEND    : 0x4000c296  LCOUNT  : 0x00000000  
[22:02:16]
[22:02:16]Backtrace: 0x4018a098:0x3ffb1ea0 0x400ecc97:0x3ffb1ed0 0x400ead33:0x3ffb1f30 0x400f44a5:0x3ffb1f60 0x400f4ca2:0x3ffb1f90 0x4010f685:0x3ffb1fb0 0x4008a34d:0x3ffb1fd0
[22:02:16]
[22:02:16]Rebooting...

LoadProhibited means the CPU was trying to load from an illegal address in memory.

The CPU is trying to read from address 0x0000001c.

https://www.esp32.com/viewtopic.php?t=10403#:~:text=ed%20(LoadProhibited).-,Exception%20was%20unhandled.,an%20illegal%20address%20in%20memory.&text=The%20CPU%20was%20trying%20to,0x00000008%20which%20is%20definitely%20illegal.

pbolduc commented 3 years ago

If I got the matching firmware, this is where the bug is caused by

PC: 0x4018a098: PinManagerClass::isPinOk(unsigned char, bool) at wled00\pin_manager.cpp line 42
EXCVADDR: 0x0000001c

Decoding stack results
0x4018a098: PinManagerClass::isPinOk(unsigned char, bool) at wled00\pin_manager.cpp line 42
0x400ecc97: ArduinoJson6170_90::operator== (ArduinoJson6170_90::MemberProxy , unsigned char const&) at wled00\src/dependencies/json/ArduinoJson-v6.h line 2195
0x400ead33: handleNightlight() at wled00\led.cpp line 300
0x400f44a5: WLED::loop() at wled00\wled.cpp line 140
0x400f4ca2: WLED::instance() at wled00/wled.h line 592
0x4010f685: base64_encode_blockend at C:\Users\176920\.platformio\packages\framework-arduinoespressif32\cores\esp32\libb64\cencode.c line 94
0x4008a34d: vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c line 143

Which doesn't really make sense, at least for isPinOk. What is the build definition that creates WLED_0.11.0_ESP32_ledpin16.bin in the releases?

pbolduc commented 3 years ago

in wled_server.cpp - https://github.com/Aircoookie/WLED/blob/3a3948e74fb4bd72e70f5516bd7296d98a3ab9e6/wled00/wled_server.cpp#L88-L90

and in presets.cpp - https://github.com/Aircoookie/WLED/blob/3a3948e74fb4bd72e70f5516bd7296d98a3ab9e6/wled00/presets.cpp#L9-L17

These two can occur concurrently. That is why generally in the latter, I like to code it more like,

JsonDocument* doc = fileDoc;
  if (doc) {
    errorFlag = readObjectFromFileUsingId("/presets.json", index, doc) ? ERR_NONE : ERR_FS_PLOAD;
    JsonObject fdo = doc->as<JsonObject>();

So that even if another thread/core nulls it out, you already captured value and using the reference. However, the error is not null reference, it trying access address 1c (28).

yetdog commented 3 years ago

For the record, the last few evenings I have switched to injecting "preset" changes via POSTing JSON. So I'm not really using presets; I've simply taken the JSON for each preset and used it to set the effects/brightness/segment info via HTTP POST. This seems to serve as a workaround at best, but now there's the issue of it dropping off my wifi, but not repeatedly or constantly. It's been up for 24h+, but stopped responding to a rolling ping I've had going last night at 4:25am, and in order to get it back I need to go outside and yank the plug.

andyshinn commented 3 years ago

but now there's the issue of it dropping off my wifi

Have you tried disabling WiFi sleep in WiFi Settings?

image

yetdog commented 3 years ago

but now there's the issue of it dropping off my wifi

Have you tried disabling WiFi sleep in WiFi Settings?

image

Absolutely - disabled on all of mine (power isn't an issue - at least I'm not that greedy @ this point) ;)

yetdog commented 3 years ago

I was able to reproduce using the web user interface. Create two presets, enable preset cycle and click between the two. Happens quick.

image

Preset 1

{"on":true,"bri":128,"transition":7,"mainseg":0,"seg":[{"id":0,"start":0,"stop":30,"grp":1,"spc":0,"on":true,"bri":255,"col":[[255,160,0],[0,0,0],[0,0,0]],"fx":68,"sx":128,"ix":128,"pal":0,"sel":true,"rev":false,"mi":false},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0}]}

Preset 2

{"on":true,"bri":128,"transition":7,"mainseg":0,"seg":[{"id":0,"start":0,"stop":30,"grp":1,"spc":0,"on":true,"bri":255,"col":[[255,160,0],[0,0,0],[0,0,0]],"fx":0,"sx":128,"ix":128,"pal":0,"sel":true,"rev":false,"mi":false},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0},{"stop":0}]}
[22:01:38]rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
[22:01:38]configsip: 0, SPIWP:0xee
[22:01:38]clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
[22:01:38]mode:DOUT, clock div:2
[22:01:38]load:0x3fff0018,len:4
[22:01:38]load:0x3fff001c,len:1044
[22:01:38]load:0x40078000,len:8896
[22:01:38]load:0x40080400,len:5828
[22:01:38]entry 0x400806ac
[22:01:38]Ada
[22:02:16]Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.
[22:02:16]Core 1 register dump:
[22:02:16]PC      : 0x4018a098  PS      : 0x00060030  A0      : 0x800ecc9a  A1      : 0x3ffb1ea0  
[22:02:16]A2      : 0x00000014  A3      : 0x00000000  A4      : 0x3ffd39f4  A5      : 0x084ae0f7  
[22:02:16]A6      : 0x3ffb1f0c  A7      : 0x3ffc59c0  A8      : 0x00000014  A9      : 0x3ffb1df0  
[22:02:16]A10     : 0x00000001  A11     : 0x084ae0f7  A12     : 0x084ae0f7  A13     : 0x3ffb0000  
[22:02:16]A14     : 0x3ffc30f0  A15     : 0x3ffc143c  SAR     : 0x00000010  EXCCAUSE: 0x0000001c  
[22:02:16]EXCVADDR: 0x0000001c  LBEG    : 0x4000c28c  LEND    : 0x4000c296  LCOUNT  : 0x00000000  
[22:02:16]
[22:02:16]Backtrace: 0x4018a098:0x3ffb1ea0 0x400ecc97:0x3ffb1ed0 0x400ead33:0x3ffb1f30 0x400f44a5:0x3ffb1f60 0x400f4ca2:0x3ffb1f90 0x4010f685:0x3ffb1fb0 0x4008a34d:0x3ffb1fd0
[22:02:16]
[22:02:16]Rebooting...

LoadProhibited means the CPU was trying to load from an illegal address in memory.

The CPU is trying to read from address 0x0000001c.

https://www.esp32.com/viewtopic.php?t=10403#:~:text=ed%20(LoadProhibited).-,Exception%20was%20unhandled.,an%20illegal%20address%20in%20memory.&text=The%20CPU%20was%20trying%20to,0x00000008%20which%20is%20definitely%20illegal.

For more context, I'm seeing this exact same thing in 0.10.2 as well on my ESP32. Certain effects / changes just seem to blow things up:

[08:45:57]Guru Meditation Error: Core  1 panic'ed (InstrFetchProhibited). Exception was unhandled.
[08:45:57]Core 1 register dump:
[08:45:57]PC      : 0x00000000  PS      : 0x00060830  A0      : 0x80100aa8  A1      : 0x3ffb1e10  
[08:45:57]A2      : 0x3ffc2518  A3      : 0x00000000  A4      : 0x3ffb1ee0  A5      : 0x00000020  
[08:45:57]A6      : 0x3ffc251c  A7      : 0x3ffb8f40  A8      : 0x801009dd  A9      : 0x3ffb1e00  
[08:45:57]A10     : 0x3ffb8f40  A11     : 0x3ffb1e4c  A12     : 0xa5e353f0  A13     : 0x3f80624d  
[08:45:57]A14     : 0x00000000  A15     : 0x00000000  SAR     : 0x0000001d  EXCCAUSE: 0x00000014  
[08:45:57]EXCVADDR: 0x00000000  LBEG    : 0x400029ac  LEND    : 0x400029cb  LCOUNT  : 0x00000000  
[08:45:57]
[08:45:58]Backtrace: 0x00000000:0x3ffb1e10 0x40100aa5:0x3ffb1f00 0x400ee0cc:0x3ffb1f30 0x4017ef65:0x3ffb1f50 0x400eed8d:0x3ffb1f70 0x400ef5de:0x3ffb1f90 0x40105649:0x3ffb1fb0 0x4008a091:0x3ffb1fd0
[08:45:58]
[08:45:58]Rebooting...

Starting to wonder if these boards might be crap.

pbolduc commented 3 years ago

Starting to wonder if these boards might be crap.

This is caused by a null reference. EXCVADDR: 0x00000000 This dump looks like a "use after free" error.

Guru Meditation Errors

InstrFetchProhibited This CPU exception indicates that CPU could not load an instruction because the the address of the instruction did not belong to a valid region in instruction RAM or ROM.

Usually this means an attempt to call a function pointer, which does not point to valid code. PC (Program Counter) register can be used as an indicator: it will be zero or will contain garbage value (not 0x4xxxxxxx).

LoadProhibited, StoreProhibited This CPU exception happens when application attempts to read from or write to an invalid memory location. The address which was written/read is found in EXCVADDR register in the register dump. If this address is zero, it usually means that application attempted to dereference a NULL pointer. If this address is close to zero, it usually means that application attempted to access member of a structure, but the pointer to the structure was NULL. If this address is something else (garbage value, not in 0x3fxxxxxx - 0x6xxxxxxx range), it likely means that the pointer used to access the data was either not initialized or was corrupted.

pbolduc commented 3 years ago

For more context, I'm seeing this exact same thing in 0.10.2 as well on my ESP32. Certain effects / changes just seem to blow things up

My speculation on this is the web server (runs on core 0 - pro cpu) the is attempting to modify the segment work area buffers while the effect is calculating on the app cpu (core 1). It just the start of my work day. I will see if I can find time to test this theory after work tonight.

pbolduc commented 3 years ago

I have made a small patch that appears to resolve this issue. Dont have time right now to create a PR. When segment_runtime.reset() is called, all it does is set a flag indicating that before the next time the effect function is called, the data should be cleared (this also clears out the data buffer). Then before running the effect function on the current segment, it resets if required.

image

image

The problem causing my crashes is above, whenever a effect on a segment/strip is changed, the data buffer used on the more advanced features is freed and nulled out. In certain effects, it assumes the buffer is allocated at the start of the function is is always valid. We have to ensure we do not free the data buffer while the effect function is executing. The more LEDs you have the higher likely hood the effect is currently running when the segment/strip's effect is changed.

Aircoookie commented 3 years ago

@pbolduc You can't imagine how grateful I am for you getting to the bottom of the/an issue! I would most likely not have thought in a multicore mindset. There might be more such race conditions in the software since all of it is written in the single-core "i can do this now since nothing else could try to access it simultaneously" mindset.

Indeed after doing a factory reset I was happy that I couldn't reproduce a single crash for 3 days. Switching beetween "Candle Multi" and "Colortwinkles" rapidly I can usually get it to reset after a few times and this gets worse with more LEDs as more time is spent in the effect function. Would love to incorporate that fix (if you do make a PR later, SEGENV.resetIfRequired() should come before SEGMENT.isActive() so that the data buffer of segments that were deleted is freed. If not, no worries - I hope I have your ok to include the fix manually 🙂)

Good point about the fileDoc pointer as well. That has potential to fail as well, albeit less than the data buffer issue. Just using a local pointer wouldn't fix the problem though, as the JsonDocument is scoped in a way that it is destructed just after the pointer is set to nullptr (Arduino JSON is using smart pointers - gotta admit those go over my head.) Filesystem could be an issue too - Both CPUs trying to access it simultaneously is most certainly no bueno. Simplest way I could think of mitigating this is a lockout flag, similar to your solution for the reset above. Basically the async core would set it before accessing the document/FS and if the main core tries to access it as well in the meantime, it will just return.

pbolduc commented 3 years ago

I was trying to play around with the atomic feature of C++ - https://www.cplusplus.com/reference/atomic/ many CPUs, ESP32 and ESP8266 have special cpu instructions that allow for synchronization across cores/threads. This can help when when you want do things like atomic_exchange to implement locks or latches.

Anyways, in a few hours when I am done my day job, I will clean up my working WLED directory and submit a PR for this specific segment reset issue in a few hours. Hopefully the people on this thread can hammer on it a bunch to see if it indeed fixes the change effect/preset issue then we can merge and get it into everyone's hand, 0.11.1 ?

yetdog commented 3 years ago

You guys are brilliant. Thank you so much. I've got it custom compiled and have been hammering on it here at my desk (no LEDs connected but it's configured with the proper count) with serial attached, and no reboots at all so far. Will pop it outside and see how things run tonight and report back.

enz1ey commented 3 years ago

This might be my issue as well, then. Sounds very similar to what I was seeing. #1333

Have any of you noticed the first few LEDs freezing white when this happens?

Aircoookie commented 3 years ago

@enz1ey yes, the first LEDs briefly turning white often happens on reset. Your issue might very well be the same as this one

yetdog commented 3 years ago

You guys are brilliant. Thank you so much. I've got it custom compiled and have been hammering on it here at my desk (no LEDs connected but it's configured with the proper count) with serial attached, and no reboots at all so far. Will pop it outside and see how things run tonight and report back.

FYI things worked well last night. I'm changing effects every 10m so it's certainly not "rapid", but when I'm in demo mode for the neighbors a crash is embarrassing ;). I've got two Christmas light controllers - one for 200 LED string lights in the front yard (ESP32 that I've been working on / testing), and an ESP8266 in the garage that runs the roof LED strips (900 LEDs) that I've yet to push @pbolduc's code to. Will do that today and then really start testing away tonight with more rapid changes for the purposes of showing off 👍

So far, so good!

CommittotheIndian commented 3 years ago

So I’ve been experiencing this issue since Saturday (it’s Monday here). I’m running WLED 11 on two DigUnos (ESP8266). I’m also transmitting xLights from an FPP controller.

On Saturday I set up 6 presets to cycle when my xLights show isn’t playing. Within 2 songs my controller disconnects. I usually have to power cycle to reconnect but not always.

I got the same issue when just using the preset effects. I believe my order was colorful, colortwinkles, can’t remember, merry Christmas, rainbow, and by this point it was crashed. I think it was crashing between colorful and colortwinkles.

I tried downgrading to 10.2 but still ran into the issue. It was an OTA downgrade not a clean install so I’ll try that.

Now I have a NodeMCU (at work) running presets on a single 5m strand with the same presets and have yet to have an issue. ✊🪵

andyshinn commented 3 years ago

I think maybe @pbolduc meant to link this issue in https://github.com/Aircoookie/WLED/pull/1465 (instead of itself). But the fix has been merged and can be compiled in the master branch. There isn't a formal released version with it, yet. I've been running with the fix for a couple days now and no crashes so far 🤞

blazoncek commented 1 year ago

There were quite a few changes for ESP32 since 2020. If the issue persist, please re-open with additional trace dumps.