Aircoookie / WLED

Control WS2812B and many more types of digital RGB LEDs with an ESP8266 or ESP32 over WiFi!
https://kno.wled.ge
MIT License
14.57k stars 3.12k forks source link

esp32 heap corruption (caused by brightness calculation?) #734

Closed sansillusion closed 3 years ago

sansillusion commented 4 years ago

HI, thanks for the great code ! I run a esp8266 with 120 ws2812b and file server enabled without issues. I aslo run a esp32 with 89 ws2812b anf file server enabled but get random reboots. I have just tried to pull latest mster and recompile using cpu at 160mhz instead of 240 like before and flash at 40mhz instead of 80mhz. It did not fix the issue unfortunately but I had pulled arduino serial monitor just for fun and did not expect anything since I dont have debug enabled but got the following after a little while (mabe 20min)...

CORRUPT HEAP: Bad head at 0x3ffd9fcc. Expected 0xabba1234 got 0x3ffda740
abort() was called at PC 0x40083809 on core 0

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x4008a3f8:0x3ffaf780 0x4008a675:0x3ffaf7a0 0x40083809:0x3ffaf7c0 0x4008392d:0x3ffaf7f0 0x4011797f:0x3ffaf810 0x40113d11:0x3ffafad0 0x40108c8d:0x3ffafb20 0x40092031:0x3ffafb50 0x40083e82:0x3ffafb70 0x40083749:0x3ffafb90 0x4000bec7:0x3ffafbb0 0x401422a7:0x3ffafbd0 0x4014232b:0x3ffafbf0 0x40142373:0x3ffafc10 0x4014341a:0x3ffafc30 0x40143482:0x3ffafc50 0x400f31f5:0x3ffafc70 0x4013d82c:0x3ffafc90 0x4008f431:0x3ffafcc0

Rebooting...
ets Jun  8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:1044
load:0x40078000,len:10044
load:0x40080400,len:5872
entry 0x400806ac
Ada

I will try erasing flash just in case but I suspect it might be esp32 related.

sansillusion commented 4 years ago

backtrace decoded :

Decoding 19 results
0x4008a3f8: invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 695
0x4008a675: abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 695
0x40083809: lock_acquire_generic at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/locks.c line 143
0x4008392d: _lock_acquire_recursive at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/locks.c line 171
0x4011797f: _vfiprintf_r at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/vfprintf.c line 860 (discriminator 2)
0x40113d11: fiprintf at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdio/../../../.././newlib/libc/stdio/fiprintf.c line 50
0x40108c8d: __assert_func at /Users/ivan/e/newlib_xtensa-2.2.0-bin/newlib_xtensa-2.2.0/xtensa-esp32-elf/newlib/libc/stdlib/../../../.././newlib/libc/stdlib/assert.c line 59 (discriminator 8)
0x40092031: multi_heap_free at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/heap/multi_heap_poisoning.c line 315
0x40083e82: heap_caps_free at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/heap/heap_caps.c line 232
0x40083749: _free_r at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/syscalls.c line 42
0x401422a7: mem_free at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/mem.c line 176
0x4014232b: do_memp_free_pool at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/memp.c line 459
0x40142373: memp_free at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/memp.c line 489
0x4014341a: tcp_close_shutdown at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/tcp.c line 1759
0x40143482: tcp_close at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/core/tcp.c line 1759
0x400f31f5: _tcp_close_api(tcpip_api_call_data*) at C:\Users\Steve\Documents\Arduino\libraries\AsyncTCP-master\src/AsyncTCP.cpp line 874
0x4013d82c: tcpip_thread at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/lwip/lwip/src/api/tcpip.c line 483
0x4008f431: vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c line 355 (discriminator 1)
sansillusion commented 4 years ago

Still get reboots after erased flash. "Lake" effect seems to speed up crash. I was on lake clicked "preview" preview loaded ok, clicked again to stop preview and it crashed.

sansillusion commented 4 years ago

I have compiled without the file server option and it seems to be stable for now, will report back if it crashes.

weini22 commented 4 years ago

I´m experiencing reboots with the provide esp32 binary since release 0.9.1. I had no such issues with 0.9 b1.

sansillusion commented 4 years ago

Still crashed even with file server disabled. Just took a little longer.

Aircoookie commented 4 years ago

@sansillusion sorry for the issue!

Are you using MQTT or the Alexa integration? Sometimes the "Enable MQTT" checkbox in Sync settings is active even though no server is configured. For Alexa, changing the Invocation name from "Light" to something else may help. If you don't use it, I would even recommend completely disabling it.

sansillusion commented 4 years ago

None of these settings are activated. I only use sync to sync the esp8266. I will compile a version with everything disabled just to see but I do plan on using other functionality in the future.

weini22 commented 4 years ago

I now tested with MQTT disabled: Same issue, it reboots after 3-10h. I even reboots when it is switchedd to "off" in the GUI. Not sure, if it processes the animations in this case at all.

weini22 commented 4 years ago

If you want to replicate the issue: The time until reboot seems to depend on the effect chosen. When I run the "Pride 2015", it reboots after 1-2h. When I use other, less resource intensive effects, it takes much longer.

sansillusion commented 4 years ago

Still crarshes even when all features disabled in compile time. Just take a little longer. Looks like a potential memory leak or maybe multitasking issue. I am just an amateur so I can't really help except by testing.

sansillusion commented 4 years ago

I find it funny that my esp8266 is not crashing even with more features enabled.

Aircoookie commented 4 years ago

I'll have to do some serious debugging to do on this one.... @sansillusion what is your esp32 arduino core version?

sansillusion commented 4 years ago

I will have to check once I get back home. I had pulled the latest master Saturday.

sansillusion commented 4 years ago

The esp8266 is still going strong without crashes but the esp32 in unstable unfortunately. Current esp32 core version : v3.3.1-61-g367c3c09c

sansillusion commented 4 years ago

I have tried multiple different configuration options and it always crashes. from the decoded backtrace it looks like it would be AsyncTCP that is bugged but I am no programmer and dont really know where to look for a solution.

sansillusion commented 4 years ago

I also got these in the console : 19:27:26.160 -> [W][AsyncTCP.cpp:630] _poll(): rx timeout 4 19:27:43.706 -> [W][AsyncTCP.cpp:630] _poll(): rx timeout 4

sansillusion commented 4 years ago

These seem to have appeared when I enabled NTP time server.

weini22 commented 4 years ago

Impressive, looks to me you identified the root cause. I disabled the NTP option and have now 10h uptime with continous usage. That was not possible before. I´ll keep you posted how this develops further on.

Aircoookie commented 4 years ago

Interesting, my ESP32 testing device has 26 days uptime with NTP disabled. Just enabled it and it synced the time successfully. Now we'll see if it reboots.

sansillusion commented 4 years ago

I tried disabling NTP yesterday, it did last much longer before a crash but still crashed 24h later. Much better than before but still a heap issue. I have done some digging and only found a closed bug report (that was never fixed) where esp32 Asynchronous web server was found to create a heap corruption. Someone found that lwip for eap32 is not meant for multi process, and that a complete rewrite of lwip and many libraries using it would be necessary. The only solution provided was to expect a heap corruption after usage. Does not look good to me.

Le jeu. 19 mars 2020 1 h 19 p.m., Aircoookie notifications@github.com a écrit :

Interesting, my ESP32 testing device has 26 days uptime with NTP disabled. Just enabled it and it synced the time successfully. Now we'll see if it reboots.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Aircoookie/WLED/issues/734#issuecomment-601309075, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHBFDCVSKHSQRY6YF2FNSRTRIJH3PANCNFSM4K5GFO6Q .

Aircoookie commented 4 years ago

@sansillusion you are right. After enabling NTP yesterday it did not crash for 14 more hours. However I noticed that the free heap was 177776 so I rebooted and right after rebooting the heap was reported as 202028, a 24kB difference that can only be explained with a memory leak. Can you send me the link to that closed issue?

sansillusion commented 4 years ago

That is what I found... https://github.com/espressif/arduino-esp32/issues/1101

weini22 commented 4 years ago

I see, we´re not yet there. My esp32 rebootet after about 26h (with NTP disabled).

weini22 commented 4 years ago

It looks to me like the issue is partially gone with 0.10! If I disable the ntp server, I´ve 3 days uptime now. That was unthinkable with 0.9.1. However, when I´ve the ntp server activated, it still reboots after a few hours. Free heap size is always around 194-196k on my esp32.

As I don´t need the ntp server, I´m more than happy personally!

@Aircoookie: Thanks a lot again for your great work!

sansillusion commented 4 years ago

Ok I think I finally found what was my problem. Up till recently they where still crashing regularly. When I changed router to see if it was the issue I also changed some settings in one of my wled device but not the second one. Well turns out deactivating "Auto brightness adjustment" fixed my issue. No more reboot or hang in more than a week after these changes. I just changed the setting in my second device since it was still crashing/rebooting often. Will report back in a while to see if it worked on the second one too this would confirm that for me.

sansillusion commented 4 years ago

The only side effect that gives me (it's not a problem at all) is that home assistant reports a current usage of 0ma. Now even running Hyperion for a week it still respond all the time. Btw thanks for your great work again ! You have made such a complete solution for lighting i love it !

Aircoookie commented 4 years ago

Thank you for finding a link to the brightness limiter and awesome that you seem to be able to get rid of the problem! Yeah, the current calculation is only active if the limiter is enabled. What current limit did you have the brightness adjuster set to when the issues occured?

sansillusion commented 4 years ago

Go in settings/led preferences and uncheck "Enable automatic brightness limiter" I am noe at over a week without rebooting/hanging.

sansillusion commented 4 years ago

@Aircoookie I had it set at 7000 since my power supply is 8A (kept it safe) but realised that with only 61 LED's I could run it at full brightness without risking going over. I decided to turn it off to try to maximize cpu cycles so Hyperion would run as smoothly as possible. Now no more reboot/hang and Hyperion is smooth as f%#&.

sansillusion commented 4 years ago

Yep ! I have not had any crash with none of my esp32's ! So it does seem that the feature that calculate the current consumption is part of the issue. If you think you get the issue fixed just let me know I'll give it a test. But I am now extremely happy with WLED ! It's such a complete solution ! And now that I can have home assistant and Hyperion working without a single issue I can keep improving my setup and add more controllers to the mix ! Thanks a lot for your great work. Will do another donation soon. (Next month)

Legsmaniac commented 4 years ago

This is really strange. I don't understand how a simple current calculation can cause a crash. I've been having some crash and WiFi drop problems too so nevertheless, I decided to try this. Sadly, it didn't make any difference for me but what did was turning off Alexa and/or making sure the MCU is properly cleared prior to a clean upload of WLED. More research needed I think.

weini22 commented 4 years ago

I fully agree with @Legsmaniac: My reboot issues are gone since I disabled the "send to Alexa" and "send to hue". I still have the current calculation active all the time and have now uptimes of 20+ days. I also experienced the situation that I had reboot issues even with Alexa disabled after changing and saving some settings. Simply changes something -> changing back -> saving again, solved that issue. And I always had at least 185k of heap space available on my NodeMCU when I checked the setting.

My personal conclusions:

  1. I don´t believe that this is a heap issue, it looks more to me that a certain combination of features is causing the issue when they´re activated at the same time
  2. There seems to be a sporadic issue with saved settings are causing stability problems. It seems to be not based on the settings, but looks more like the saved representation is causing the issue.

Having a backup/restore option for the full set of configuration data would be a great support to further track this down. However, I have a fully stable setup now with 2 NodeMCUs, so I´m more than happy.

sansillusion commented 4 years ago

I am not saying that it's a definitive solution for everyone but for me disabling Alexa or Hue, in fact I tried disabling everything I could and nothing worked for me. But now that I have disabled the current limiting feature, everything else is enabled and the esp32's I am running are all stable and very responsive now. Before disabling the current limiting feature, I had some stiffness in Hyperion but now It's live and smooth.

sansillusion commented 4 years ago

I must add that after disabling current limiting, my second esp crashed a few hours after but I had not rebooted the device after the change. Once rebooted it stayed connected till today (my fault I rebooted my router trying to find what was that new esp connected to my router, turns out it was my new sonoff switch lol)

stale[bot] commented 3 years ago

Hey! This issue has been open for quite some time without any new comments now. It will be closed automatically in a week if no further activity occurs. Thank you for using WLED!