Aircoookie / WLED

Control WS2812B and many more types of digital RGB LEDs with an ESP8266 or ESP32 over WiFi!
https://kno.wled.ge
MIT License
14.95k stars 3.23k forks source link

Resets with IDF 4.4/Arduino Core 2.x platform on network request #2518

Closed Aircoookie closed 10 months ago

Aircoookie commented 2 years ago

What happened?

WLED frequently resets upon network requests (e.g. to /json while loading main UI, applying presets...) while (SK6812) LEDs are on.

To Reproduce Bug

Compile WLED with the custom IDF 4.4 + arduino 2.0.2 platform courtesy of Tasmota (this branch). The LITTLEFS library by lorol is replaced by the built-in LittleFS in the 2.x.x arduino core.

The reset reason is the low level RMT interrupts failing during a network request, although I don't yet understand the exact cause. Exception decoder trace below.

This has been reproduced on ESP32 and ESP32-S2 and the platforms:

https://github.com/tasmota/platform-espressif32/releases/download/v2.0.2/platform-tasmota-espressif32-2.0.2.zip
https://github.com/tasmota/platform-espressif32/releases/download/v2.0.2.1/platform-tasmota-espressif32-2.0.2.1.zip
https://github.com/tasmota/platform-espressif32/releases/download/v2.0.2.2/platform-tasmota-espressif32-2.0.2.zip

Expected Behavior

No resets

Install Method

Self-Compiled

What version of WLED?

arduinocore2 branch

Which microcontroller/board are you seeing the problem on?

ESP32

Relevant log/trace output

Guru Meditation Error: Core  1 panic'ed (Cache disabled but cached memory region accessed). 

Core  1 register dump:
PC      : 0x4018be44  PS      : 0x00060035  A0      : 0x40086098  A1      : 0x3ffbf52c  
A2      : 0x3ffbe048  A3      : 0x00000000  A4      : 0x00000080  A5      : 0x00000001
A6      : 0x00000001  A7      : 0x00000000  A8      : 0x80081aff  A9      : 0x00000020  
A10     : 0x3ff56000  A11     : 0x3ffb34f0  A12     : 0xffffffff  A13     : 0x00000020
A14     : 0x3ffbe054  A15     : 0x3ffb3358  SAR     : 0x00000008  EXCCAUSE: 0x00000007  
EXCVADDR: 0x00000000  LBEG    : 0x400813e0  LEND    : 0x400813f4  LCOUNT  : 0x00000000

Backtrace:0x4018be41:0x3ffbf52c |<-CORRUPTED
  #0  0x4018be41:0x3ffbf52c in rmt_ll_enable_tx_err_interrupt at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/hal/esp32/include/hal/rmt_ll.h:253

Anything else?

@Jason2866 I suspect that this issue is due to a programming error on my end surfacing with the newer IDF/Arduino releases or possibly with the exchanged LittleFS library. Still, if you know what could be a likely cause, I'd be super happy to know as I see you did quite a few changes to sdkconfig (which I really like, as it drastically reduces binary size!)

My first guess was that the issue was due to CONFIG_DISABLE_HAL_LOCKS being set in 2.0.2.1, however the issue is also present in https://github.com/tasmota/platform-espressif32/releases/download/v2.0.2/platform-tasmota-espressif32-2.0.2.zip, where that option was unset as far as I can tell.

Code of Conduct

blazoncek commented 2 years ago

@Aircoookie I am not seeing any resets no matter how fast I hit WLED with /json requests on my fork. I have tried as fast as 10 reloads/s while simultaneously running a playlist with 600 LEDs attached and playlist switching presets every few seconds. I did rewrite preset loading logic and optimized JSON buffer usage so that hardly anything serious happen in network callbacks (apart from ArduinoJson handling).

My build environment is:

[env:wemos_d1_mini32]
board = esp32dev
platform = espressif32@3.4.0
board_build.partitions = tools/WLED_ESP32_4MB_1MB_FS.csv
board_build.filesystem = littlefs
build_unflags = ${common.build_unflags}
build_flags = ${common.build_flags_esp32} -D WLED_RELEASE_NAME=ESP32
  -D WLED_DISABLE_BLYNK
  -D WLED_DISABLE_CRONIXIE
  -D WLED_DISABLE_HUESYNC
  -D WLED_DISABLE_LOXONE
  -D USERMOD_DALLASTEMPERATURE
  -D TEMPERATURE_PIN=23
  -D LEDPIN=16
  -D RLYPIN=19
  -D BTNPIN=17
  -D IRPIN=18
  -D USERMOD_FOUR_LINE_DISPLAY
  -D USE_ALT_DISPlAY
  -D USERMOD_AUTO_SAVE
  ;-D USERMOD_ANIMATED_STAIRCASE
  -D USERMOD_PIRSWITCH
  -D PIR_SENSOR_PIN=26
  -D USERMOD_MULTI_RELAY  # 14, 15, 32
  -D MULTI_RELAY_MAX_RELAYS=3
  -UWLED_USE_MY_CONFIG
lib_deps = ${esp32.lib_deps}
  U8g2@~2.28.8
  paulstoffregen/OneWire@~2.3.6
  Wire

[env:wemos_d1_mini32_debug]
extends = env:wemos_d1_mini32
build_flags = ${common.build_flags_esp32} -D WLED_RELEASE_NAME=ESP32
  -D WLED_DEBUG
  -D LEDPIN=16
  -D RLYPIN=19
  -D BTNPIN=17
  -D IRPIN=18
  -D USERMOD_DALLASTEMPERATURE
  -D TEMPERATURE_PIN=23
  -D USERMOD_FOUR_LINE_DISPLAY
  -D USERMOD_ROTARY_ENCODER_UI
  -D USE_ALT_DISPlAY
  -D USERMOD_AUTO_SAVE
  -D USERMOD_PIRSWITCH
  -D PIR_SENSOR_PIN=36
  -D USERMOD_MULTI_RELAY  # 14, 15, 32
  -D MULTI_RELAY_MAX_RELAYS=3
  -D STATUSLED=-1
  -D WLED_DISABLE_BROWNOUT_DET
  -UWLED_USE_MY_CONFIG
Aircoookie commented 2 years ago

@blazoncek does not happen with espressif32@3.4.0 on my end either. This issue is exclusive to arduino core 2.0.2, or IDF 4.4, or the custom Tasmota library configuration

Jason2866 commented 2 years ago

@Aircoookie Looks like a bug in the Arduino core 2.0.x. Core 2.0.x is heavily refactored. I doubt that the problem is related to wled code. You can check if the issues are related to our customization with this setup.

platform = https://github.com/tasmota/platform-espressif32/releases/download/v2.0.2.2/platform-tasmota-espressif32-2.0.2.zip
platform_packages = framework-arduinoespressif32 @ https://github.com/espressif/arduino-esp32.git

Using this setup uses latest orig. Arduino ESP32 from espressif repo

EDIT: Just asked our dev for the "Lights" part in Tasmota. We do not have issues in Tasmota when using RMT

Aircoookie commented 2 years ago

@Jason2866 Thank you for the info! I've tried that setup and the issue persists, so we can rule out your customizations as a possible cause. I agree that the RMT refactorings seems to be the likely cause, I'll try if I can come up with a MVCE for the core team.

blazoncek commented 2 years ago

I had opened an issue with @makuna a while ago and it turns out the latest RMT code (also) has problems feeding data to RMT buffers. Especially when ISRs take more time than needed. I don't recall issue # but he had it open with ESP32 core regarding that.

s-hadinger commented 2 years ago

FYI I can't reproduce this on Tasmota using RMT and latest Core/idf, however hard I hammer it on the network side. Tested on Atom Matrix (Esp32 Pico D4) with an animation over 25 leds every 50ms. This does not confirm a bug in Arduino or IDF

blazoncek commented 2 years ago

Found the thread... https://github.com/Makuna/NeoPixelBus/issues/524#issuecomment-942557735

Makuna commented 2 years ago

@Aircoookie Do a symbol dump on that method "rmt_ll_enable_tx_err_interrupt", if you actually find it listed what is the address? It shouldn't be listed (it is marked inline) but sometimes the compiler will ignore the suggestion of inline. There is a known GNU C++ bug that static inline functions may not honor attributes (inherited or specifically marked) and thus not be placed into IRAM memory location. Thus when run will require it to be loaded from cache (note your error). I ran into this myself and had to do some shenanigans to get the methods marked with IRAM (no longer inline, no longer listed in a header).

Aircoookie commented 2 years ago

@Makuna thank you for the helpful pointer! Not entirely positive this is what you are referring to, but as expected, nm /.pio/build/esp32dev/libFrameworkArduino.a does not list this function, and does not list any other rmt_ll functions either, however it does list various other RMT functions from file esp32-hal-rmt.c.o: nm_framework.txt

I am also confused about this line in the error: Backtrace:0x4018be41:0x3ffbf52c |<-CORRUPTED, other instances of the error Cache disabled but cached memory region accessed I researched never had a corrupted backtrace.

@s-hadinger thank you for testing! The difference might very well be that WLED uses ESPAsyncWebserver, which in itself operates on interrupts upon network requests while Tasmota uses (as far as I can tell) the synchronous ESP8266Webserver.

Aircoookie commented 2 years ago

Perhaps whatever function is inlining rmt_ll_enable_tx_err_interrupt is not placed in IRAM...

Aircoookie commented 2 years ago

Not have conclusive evidence, but it seems like this only happens if the network request accesses flash data in any way (filesystem operation, or progmem strings)

Aircoookie commented 2 years ago

Also strange that core 1 (handling network interrupts) panics. I would have assumed RMT interrupts to use core 0.

kroimon commented 2 years ago

I have the issue with 0.13.1 where it frequently crashes and reboots when being controlled via Home Assistant, but I am not sure if it is due to the same reason. It definitely has to do with API calls though.

The Environment string on the info pages says "esp32 V3.3.6-16-gcc5440f6a2".

Is there anything I can do to help debug the issue?

Jason2866 commented 2 years ago

Latest Tasmota framework (based on 2.0.3 rc1 and IDF 4.4.1) can be used with

platform = https://github.com/tasmota/platform-espressif32/releases/download/v2.0.3rc1/platform-espressif32-2.0.3new.zip
Kyry11 commented 2 years ago

I have built latest master using the above "v2.0.3rc1" from Tasmota. It produces the same result on my M5Stamp (C3U) as described by @Aircoookie at the top. Constant restarts, seemingly on any request sent from UI.

Kyry11 commented 2 years ago

I have now also tried a build with the just released 2.0.3 (not rc) from Tasmota. Same result as above, unsurprisingly.

fallspectrum commented 2 years ago

Getting some trace output with the latest official (non-RC) 2.0.3 tasmota release (https://github.com/tasmota/platform-espressif32/releases/tag/v.2.0.3) when replicating this issue on esp32-c3 (previously with 2.0.2 I was only getting stack memory dump, mentioned here: https://github.com/Aircoookie/WLED/issues/2596#issuecomment-1120123902)

Though, unsure if the trace output is accurate, based on WIP PR: https://github.com/platformio/platform-espressif32/pull/612

Guru Meditation Error: Core  0 panic'ed (Load access fault). Exception was unhandled.

Core  0 register dump:
MEPC    : 0x42048a02  RA      : 0x42048a56  SP      : 0x3fcad5a0  GP      : 0x3fc8b400  
=> 0x42048a02: String::buffer() const at /Users/trav/.platformio/packages/framework-arduinoespressif32/cores/esp32/WString.h:328 (discriminator 2)
      (inlined by) String::c_str() const at /Users/trav/.platformio/packages/framework-arduinoespressif32/cores/esp32/WString.h:236 (discriminator 2)
      (inlined by) AsyncWebServerRequest::_removeNotInterestingHeaders() at .pio/libdeps/esp32c3/ESP Async WebServer/src/WebRequest.cpp:182 (discriminator 2)
=> 0x42048a56: LinkedList<AsyncWebHeader*, LinkedListNode>::Iterator::operator++() at .pio/libdeps/esp32c3/ESP Async WebServer/src/StringArray.h:53 (discriminator 3)
      (inlined by) AsyncWebServerRequest::_removeNotInterestingHeaders() at .pio/libdeps/esp32c3/ESP Async WebServer/src/WebRequest.cpp:181 (discriminator 3)
TP      : 0x3fc89348  T0      : 0x40388812  T1      : 0xffffffe0  T2      : 0xffffffff  
=> 0x40388812: vPortExitCritical at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port/riscv/port.c:281
S0/FP   : 0x3fcb151c  S1      : 0x3fcb19d4  A0      : 0x00000001  A1      : 0x6254be41  
A2      : 0x3fcb1870  A3      : 0x00000001  A4      : 0x3fc99954  A5      : 0x600c2000  
A6      : 0x47314252  A7      : 0x34392f76  S2      : 0x3fc9a1c4  S3      : 0x3fcb19e4  
=> 0x47314252: ?? ??:0
S4      : 0x3fcb1a80  S5      : 0x00000001  S6      : 0x0000000a  S7      : 0x00000004  
S8      : 0x00000000  S9      : 0x00000000  S10     : 0x00000000  S11     : 0x00000000  
T3      : 0x4f4e6f4d  T4      : 0x45203a79  T5      : 0x654b2d74  T6      : 0x656b636f  
=> 0x4f4e6f4d: ?? ??:0
=> 0x45203a79: ?? ??:0
MSTATUS : 0x00001881  MTVEC   : 0x40380001  MCAUSE  : 0x00000005  MTVAL   : 0x6254be50  
=> 0x40380001: _vector_table at ??:?
MHARTID : 0x00000000  

Stack memory:
3fcad5a0: 0x00000000 0x00000000 0x00000000 0x0404b1be 0x0000000a 0x00000001 0x3fcb1a08 0x00000002
3fcad5c0: 0x3fcb0bf7 0x00000002 0x3fcb19d4 0x42048ad8 0x3fcb0bf7 0x00000002 0x3fcb1a08 0x4204f1be
=> 0x42048ad8: AsyncWebServerRequest::_parseLine() at .pio/libdeps/esp32c3/ESP Async WebServer/src/WebRequest.cpp:567
      (inlined by) AsyncWebServerRequest::_parseLine() at .pio/libdeps/esp32c3/ESP Async WebServer/src/WebRequest.cpp:549
=> 0x4204f1be: String::concat(char const*) at /Users/trav/.platformio/packages/framework-arduinoespressif32/cores/esp32/WString.cpp:365
3fcad5e0: 0x0000000a 0x00000001 0x3fcb1a08 0x00000002 0x3fcb0bf7 0x00000002 0x3fcb19d4 0x42048bd6
=> 0x42048bd6: AsyncWebServerRequest::_onData(void*, unsigned int) at .pio/libdeps/esp32c3/ESP Async WebServer/src/WebRequest.cpp:122
3fcad600: 0x3fcb1900 0x3c0d2000 0x3fcb19d4 0x4204899c 0x00000001 0x3fc92e18 0x42046a68 0x42046b8e
=> 0x4204899c: AsyncWebServerRequest::AsyncWebServerRequest(AsyncWebServer*, AsyncClient*) at .pio/libdeps/esp32c3/ESP Async WebServer/src/WebRequest.cpp:77
=> 0x42046a68: std::_Function_base::_Base_manager<AsyncWebServerRequest::AsyncWebServerRequest(AsyncWebServer*, AsyncClient*)::{lambda(void*, AsyncClient*)#8}>::_M_manager(std::_Any_data&, std::_Function_base::_Base_manager<AsyncWebServerRequest::AsyncWebServerRequest(AsyncWebServer*, AsyncClient*)::{lambda(void*, AsyncClient*)#8}> const&, std::_Manager_operation) at /Users/trav/.platformio/packages/toolchain-riscv32-esp@8.4.0+2021r2-patch3/riscv32-esp-elf/include/c++/8.4.0/bits/std_function.h:196
=> 0x42046b8e: std::_Function_handler<void (void*, AsyncClient*), AsyncWebServerRequest::AsyncWebServerRequest(AsyncWebServer*, AsyncClient*)::{lambda(void*, AsyncClient*)#8}>::_M_invoke(std::_Any_data const&, void*&&, AsyncClient*&&) at /Users/trav/.platformio/packages/toolchain-riscv32-esp@8.4.0+2021r2-patch3/riscv32-esp-elf/include/c++/8.4.0/bits/std_function.h:295
3fcad620: 0x00000002 0x3fcb196c 0x3fcb1958 0x00000001 0x00000000 0x3fcb09c4 0x3fcb1900 0x42048da6
=> 0x42048da6: std::_Function_handler<void (void*, AsyncClient*, void*, unsigned int), AsyncWebServerRequest::AsyncWebServerRequest(AsyncWebServer*, AsyncClient*)::{lambda(void*, AsyncClient*, void*, unsigned int)#7}>::_M_invoke(std::_Any_data const&, void*&&, AsyncClient*&&, std::_Any_data const&, unsigned int&&) at /Users/trav/.platformio/packages/toolchain-riscv32-esp@8.4.0+2021r2-patch3/riscv32-esp-elf/include/c++/8.4.0/bits/std_function.h:299
3fcad640: 0x00000000 0x3fcb09c4 0x3fcb1900 0x42040db0 0x3fcb19d4 0x3fcb1900 0x3fcb0a12 0x000001e7
=> 0x42040db0: AsyncClient::_recv(tcp_pcb*, pbuf*, signed char) at .pio/libdeps/esp32c3/AsyncTCP/src/AsyncTCP.cpp:970
3fcad660: 0x00000002 0x00000005 0x3fc99000 0x3fc99664 0x00000001 0x3fcb1900 0x3fcab7a8 0x42040e94
=> 0x42040e94: AsyncClient::_s_recv(void*, tcp_pcb*, pbuf*, signed char) at .pio/libdeps/esp32c3/AsyncTCP/src/AsyncTCP.cpp:1248
3fcad680: 0x00000001 0x3fcb1900 0x3fcab7a8 0x42041538 0x00000000 0x3fcab7a8 0x00000000 0x00000000
=> 0x42041538: _async_service_task(void*) at .pio/libdeps/esp32c3/AsyncTCP/src/AsyncTCP.cpp:166
      (inlined by) _async_service_task at .pio/libdeps/esp32c3/AsyncTCP/src/AsyncTCP.cpp:201
3fcad6a0: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
3fcad6c0: 0x00000000 0x00000000 0x00000000 0x403887ba 0x00000000 0x00000000 0x00000000 0x00000000
=> 0x403887ba: prvTaskExitError at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port/riscv/port.c:157
3fcad6e0: 0x00000000 0xa5a5a5a5 0xa5a5a5a5 0xa5a5a5a5 0xffffff02 0x000000ff 0x27146414 0x01010107
3fcad700: 0x00000001 0x00000000 0x35d1cfdc 0x00000000 0x3f15c000 0x3f800000 0x3fc00000 0x4b000000
=> 0x4b000000: ?? ??:0
3fcad720: 0xcb000000 0x000000ff 0x8800759c 0x7e37e43c 0x7149f2ca 0x00000000 0x00000000 0xfff00000
3fcad740: 0x00000000 0x7ff00000 0x60000000 0x3ff71547 0xf85ddf44 0x3e54ae0b 0x00000000 0x3fd00000
3fcad760: 0x55555555 0x3fd55555 0x652b82fe 0x3ff71547 0x00000000 0x43400000 0x4a454eef 0x3fca7e28
=> 0x43400000: ?? ??:0
=> 0x4a454eef: ?? ??:0
3fcad780: 0x93c9db65 0x3fcd864a 0xa91d4101 0x3fd17460 0x518f264d 0x3fd55555 0xdb6fabff 0x3fdb6db6
3fcad7a0: 0x33333303 0x3fe33333 0xe0000000 0x3feec709 0xdc3a03fd 0x3feec709 0x145b01f5 0xbe3e2fe0
3fcad7c0: 0x652b82fe 0x3c971547 0xc2f8f359 0x01a56e1f 0x00000000 0x3fe62e43 0xfefa39ef 0x3fe62e42
3fcad7e0: 0x0ca86c39 0xbe205c61 0x72bea4d0 0x3e663769 0xc5d26bf1 0x3ebbbd41 0xaf25de2c 0x3f11566a
3fcad800: 0x16bebd93 0x3f66c16c 0x5555553e 0x3fc55555 0xcc000000 0x4c000000 0x3f317180 0x3717f7d1
=> 0x4c000000: ?? ??:0
3fcad820: 0x3eaaaaab 0x3e178897 0x3e3a3325 0x3e924925 0x3f2aaaab 0x3e1cd04f 0x3e638e29 0x3ecccccd
3fcad840: 0x3fb8aa00 0x36eca570 0x3fb8aa3b 0x4b800000 0x3e53f142 0x3e6c3255 0x3e8ba305 0x3edb6db7
=> 0x4b800000: ?? ??:0
3fcad860: 0x3f19999a 0x3f763800 0x3f76384f 0x369dc3a0 0x3338aa3c 0x0da24260 0x3f317200 0x3f317218
3fcad880: 0x35bfbe8c 0x3331bb4c 0x35ddea0e 0x388ab355 0x3b360b61 0x3e2aaaab 0x00000000 0x7ff80000
3fcad8a0: 0x00000000 0x43500000 0x8800759c 0xfe37e43c 0xc2f8f359 0x81a56e1f 0x00000000 0x3c900000
=> 0x43500000: ?? ??:0
3fcad8c0: 0x7fc00000 0xf149f2ca 0x8da24260 0x33000000 0x7f7fffff 0x00000000 0xffc00000 0x41dfffff
=> 0x41dfffff: ?? ??:0
3fcad8e0: 0x94a03595 0x3fdfffff 0x35afe535 0x3fe00000 0x94a03595 0x3fcfffff 0x00000000 0x3ff80000
3fcad900: 0x636f4361 0x3fd287a7 0x8b60c8b3 0x3fc68a28 0x509f79fb 0x3fd34413 0x00000000 0x401c0000
=> 0x401c0000: ?? ??:0
3fcad920: 0x00000000 0x40140000 0x00000010 0x00000000 0xa5a5a5a5 0x00000154 0x3fcad3b0 0x3fc9a1c4
=> 0x40140000: ?? ??:0
3fcad940: 0x3fc95f44 0x3fc95f44 0x3fcad938 0x3fc95f3c 0x00000016 0x3fcab888 0x3fcab888 0x3fcad938
3fcad960: 0x00000000 0x00000003 0x3fcab934 0x6e797361 0x63745f63 0xf97a0070 0x0005d81a 0x00000000
3fcad980: 0x3fcad930 0x00000003 0x00000000 0x3fcb1e00 0x4205b1b8 0x00000000 0x3fc9aac0 0x3fc9ab28
=> 0x4205b1b8: pthread_local_storage_thread_deleted_callback at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/pthread/pthread_local_storage.c:129
stale[bot] commented 2 years ago

Hey! This issue has been open for quite some time without any new comments now. It will be closed automatically in a week if no further activity occurs. Thank you for using WLED!

softhack007 commented 2 years ago
 (inlined by) String::c_str() const at /Users/trav/.platformio/packages/framework-arduinoespressif32/cores/esp32/WString.h:236 (discriminator 2)
 (inlined by) AsyncWebServerRequest::_removeNotInterestingHeaders() at .pio/libdeps/esp32c3/ESP Async WebServer/src/WebRequest.cpp:182 (discriminator 2)

reminds me of the pending use-after-free bug in Async Webserver : https://github.com/me-no-dev/ESPAsyncWebServer/issues/951 and proposed fix in https://github.com/me-no-dev/ESPAsyncWebServer/pull/952

blazoncek commented 2 years ago

@Aircoookie perhaps you could pull this into your fork?

Jason2866 commented 2 years ago

@softhack007 Is there a easy (without connecting any leds) way to reproduce the resets? I would like to test with a recent builded framework (using a newer idf commit than espressif uses in release arduino core 2.0.5). Many bug fixes in IDF Maybe the actual Tasmota version does work?

platform = https://github.com/tasmota/platform-espressif32/releases/download/v2.0.5.1/platform-espressif32-2.0.5.1.zip

Probably fixes wifi stability issues for esp32 and esp32c3

softhack007 commented 2 years ago

Hi @Jason2866, I think the problem in async webserver was found by code analysis and inspection. So it might be hard to reliably reproduce the crash. Sorry I don't have a clear scenario for you, maybe someone else here can help with that.

We have several users reporting such crashes, and the stacktrace provided by @fallspectrum clearly points into the same direction. It seems to happen with idf v4.4.x and arduino-esp32 2.0.x on new MCUs like s2, c3, but also seen on "classic" esp32. Maybe memory management in idf was changed or it is just pure luck that we don't see such crashes with older frameworks.

The root cause seems to be a use-after-free coding error in async webserver, so I see very little chance that it can be solved with updates in other components.


Edit: it could be that we are talking about two different bugs here, as the original report from @Aircoookie was having a crash in a different area:

0 0x4018be41:0x3ffbf52c in rmt_ll_enable_tx_err_interrupt at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/hal/esp32/include/hal/rmt_ll.h:253

I've seen this one on 'classic ESP32' and on -s3 when using new framework. My solution was to use latest NeoPixelBus (master).

Jason2866 commented 2 years ago

There are changes in IDF regarding rmt. The changes are in latest Tasmota core.

blazoncek commented 2 years ago

I have reviewed AsyncWebServer's way of handling unneeded headers and although not apparent there is a possibility of deleting a pointer while still using it. I would have rewritten AsyncWebServerRequest::_removeNotInterestingHeaders() to use more explicit iterator for loop. There is something not right within AsyncWebServer as it works very odd on ESP32-S2 as I was testing it. Sometimes it would stall for a second or more other times it would just not return anything to a web request. Especially if LittleFS access was also involved.

I did not debug thoroughly yet to see if error lies in WLED code (callbacks) or AsyncWebServer itself.

Aircoookie commented 2 years ago

@Aircoookie perhaps you could pull this into your fork?

Done. https://github.com/Aircoookie/ESPAsyncWebServer.git @ ~2.0.6 Let's see if it helps!

blazoncek commented 2 years ago

Tested, it does no harm on ESP32 but it also does not help with web server on ESP32-S2. I am beginning to suspect that the problem lies lower in the WiFi stack. Possibly in WiFi itself as I can occasionally see inability to connect to a hardcoded SSID.

Jason2866 commented 2 years ago

We have zero issues with the S2 and wifi. It does connect fine with DHCP and fixed ip addresses. Connection is stable too. If AP is switched off reconnecting does work too. The issue you encounter is coming from "somewhere". We do not use EasyAsyncWebServer. Imho it is the reason.

blazoncek commented 2 years ago

Do not get me wrong, my ESP32-S2 (Lolin Wemos S2 mini) always connects to my WiFi after a cold boot, but after warm boot not always. I am also seeing it drop off of network while it is still reacting to button presses and plays WLED effects. In such cases I can also sometimes observe there is no more debug data dumped to serial interface. Since debug is dumped in the same loop() as WLED effects are run or buttons are registered this becomes weirder still. Lolin S2 mini uses CDC USB which is inoperative until firmware boots or fails for some reason.

Jason2866 commented 2 years ago

Yes, and this is related to WLED. Zero issues cold or warmboot or whatever the S2 is running without a hickup with Tasmota. The CDC is working too, no problem.

softhack007 commented 2 years ago

Yes, and this is related to WLED.

As you only talk about Tasmota, I think what it means is "related to a scenario we don't have in Tasmota". This could be many things, I guess. Some aspects that we should think about:

Just a few ideas that might help us dig down to the root cause of crashes.

blazoncek commented 2 years ago

I am guessing that the most probable suspects for issues are LittleFS and Async* library (could also be AsyncTCP as it is used by AsyncWebServer).

If I have S2 unit run a single effect it will run happily until rebooted or powered off. But as soon as I run playlist which loads presets from LittleFS every so often it will soon lock up/reboot (without any UI/web interaction). Similar thing happens if I start playing with UI (trying not to trigger any LittleFS use) some page loads will stall and/or never complete loading (files from PROGMEM strings).

Unfortunately I am no good at JTAG debugging nor have proper equipment so can't use that.

BTW, yesterday I added a few DEBUG_PRINT commands to web server callbacks and a single call to captivePortal() takes a ms or two on S2 while it completes in the same ms on regular ESP32. Funny thing is that some of those DEBUG_PRINTs never output anything which baffles me.

Jason2866 commented 2 years ago

It is NOT LittleFS. That is sure. We use it in every Tasmota32 version as default file system and it is used heavily.

think about:

blazoncek commented 2 years ago

I have ruled out NeoPixelBus (2.6.9) as LEDs are updated correctly without issues. Lockups/reboots occur with or without LEDs being updated. I am not sold out on LittleFS, though. It may be accessed from within network callback which is interrupt driven (correct? I am not sure about that).

Below is partial debug output (trimmed for clarity and comments added).

---WLED 0.14.0-bl0 2209261 INIT---
esp32 v4.4.1-472-gc9140caf8c
arduino-esp32 v2.0.4
CPU:   ESP32-S2 rev.0, 1 core(s), 240MHz.
FLASH: 4MB, Mode 2 (DIO), speed 80MHz.
heap 132376
Total PSRAM: 2045kB
Free PSRAM : 2045kB
PIN ALLOC: Pin 39 successfully allocated by 0x89 (137)
Registering usermods ...
Reading config
Reading settings from /wsec.json...
JSON buffer locked. (3)
JSON buffer released. (3)
JSON buffer locked. (1)
Reading settings from /cfg.json...
Removing all.
PIN ALLOC: Pin 16 successfully allocated by 0x82 (130)
Successfully inited strip 0 (len 150) with type 22 and pins 16,255 (itype 17)
PIN ALLOC: Pin 40 successfully allocated by 0x82 (130)
Successfully inited strip 1 (len 150) with type 22 and pins 40,255 (itype 17)
PIN ALLOC: Pin 0 successfully allocated by 0x85 (133)
PIN ALLOC: Pin 18 successfully allocated by 0x85 (133)
PIN ALLOC: Pin 7 successfully allocated by 0x86 (134)
PIN ALLOC: Pin 9 successfully allocated by 0x87 (135)
Starting usermod config.
Temperature config loaded.
PIRsensorSwitch config loaded.
4LineDisplay config loaded.
Autosave config (re)loaded.
JSON buffer released. (1)
Initializing strip
-- Copy segment constructor --
Destroying segment:
-- Copy segment constructor --
-- Move segment constructor --
Destroying segment:
Destroying segment:
-- Copy segment constructor --
Destroying segment:
Usermods setup
Allocating temperature pin...
PIN ALLOC: Pin 11 successfully allocated by 0x03 (3)
PIN ALLOC: Pin 5 successfully allocated by 0x05 (5)
PIN ALLOC: Pin 4 successfully allocated by 0x1f (31)
PIN ALLOC: Pin 2 successfully allocated by 0x1f (31)
PIN ALLOC: Pin 6 successfully allocated by 0x1f (31)
lastReconnectAttempt == 0
Access point disabled (init).
Connecting to kristan...
Not connected AP.
Opening access point WLED-AP
Init AP interfaces
Last reconnect too old.
Connecting to kristan...
---DEBUG INFO---
Runtime: 30000
Unix time: 30,000
Free heap: 55492
Total PSRAM: 2045kB
Free PSRAM: 2038kB
Wifi state: 0
State time: 30002
NTP last sync: 999000000
Client IP: 0.0.0.0
Loops/sec: 1518
UM time[ms]: 0/2
Strip time[ms]: 0/1
Segments: 2 -> 288B
Modes: 4*175=700B
Data: 4*175=700B
Map: 2*0=0B

Connected! IP address: 192.168.70.58
Init STA interfaces
mDNS started
Access point disabled (handle).
send NTP
NTP recv, l=48
Sunrise: 06:51
Sunset: 18:49
Local time: 15:42
Trigger time: 06:51
Trigger time: 18:49
...
(above is boot sequence)
...
Local time: 15:49
Trigger time: 06:51
Trigger time: 18:49
JSON buffer locked. (14)
-- Copy segment constructor --
Destroying segment:
JSON buffer released. (14)
---DEBUG INFO---
Runtime: 480092
Unix time: 1664286569,700
Free heap: 41088
Total PSRAM: 2045kB
Free PSRAM: 2039kB
Wifi state: 3
State time: 60008
NTP last sync: 33629
Client IP: 192.168.70.58
Loops/sec: 1433
UM time[ms]: 0/2
Strip time[ms]: 0/8
Segments: 2 -> 152B
Modes: 4*175=700B
Data: 4*175=700B
Map: 2*0=0B
JSON buffer locked. (14)
-- Copy segment constructor --
Destroying segment: 8 (0x3fd81f88)      <----- this is odd, should be empty
JSON buffer released. (14)
---DEBUG INFO---
Runtime: 510098
Unix time: 1664286599,705
Free heap: 41088
Total PSRAM: 2045kB
Free PSRAM: 2039kB
Wifi state: 3
...
(web request to turn WLED on and start playing effect Gravcenter, which needs Audioreactive usermod)
...
Client IP: 192.168.70.58
Loops/sec: 1025
UM time[ms]: 0/2
Strip time[ms]: 0/5
Segments: 2 -> 152B
Modes: 4*175=700B
Data: 4*175=700B
Map: 2*0=0B
[AR userLoop] hickup detected -> was inactive for last 26 millis!
---DEBUG INFO---
Runtime: 570127
Unix time: 1664286659,735
Free heap: 41088
...
(a hiccup detected in usermod)
...
---DEBUG INFO---
Runtime: 750177
Unix time: 1664286839,785
Free heap: 41084
Total PSRAM: 2045kB
Free PSRAM: 2039kB
Wifi state: 3
State time: 60008
NTP last sync: 33629
Client IP: 192.168.70.58
Loops/sec: 1087
UM time[ms]: 0/2
Strip time[ms]: 0/4
Segments: 2 -> 152B
Modes: 4*175=700B
Data: 4*175=700B
Map: 2*0=0B
Local time: 15:54
Trigger time: 06:51
Trigger time: 18:49
Disconnected!
Access point disabled (init).
Connecting to kristan...
Last reconnect too old.
Access point disabled (init).
Connecting to kristan...
[AR userLoop] hickup detected -> was inactive for last 59 millis!
[AR userLoop] hickup detected -> was inactive for last 618 millis!
Last reconnect too old.
Access point disabled (init).
Connecting to kristan...
[AR userLoop] hickup detected -> was inactive for last 614 millis!
...
(a disconnect from WiFi but LEDs are still updated, AR usermod still samples and processes audio)

From this point on WiFi can no longer reconnect (credentials hardcoded into binary) but WLED still reacts to buttons and I2S interface. Apart form one web POST request to select effect and turn LED display on (using Packet sender), there were no clients connected to the unit.

Jason2866 commented 2 years ago

If i see mDNS in espressif arduino code, i have the No1 candidate for issues. We removed in all builds (esp8266 and esp32x). Since this day all strange issues we had in the past are gone. mDNS is a mess. It will never work reliable. Edit: Only use the LittleFS version which is included with the framework. The Lorol version is not fully compatible anymore with core 2.0.x

blazoncek commented 2 years ago

Commented out mDNS and look what happens. USB serial no longer works, WLED is running effect as normal, web UI is no longer accessible after a few minutes.

https://youtu.be/u47B5J08EAg

Jason2866 commented 2 years ago

Strange problem when using ESPAsyncWebServer with a S3 https://github.com/espressif/arduino-esp32/issues/7268 Imho ESPAsyncWebServer is just working with ESP8266 / ESP32

blazoncek commented 2 years ago

Using S2 with very selective HTTP JSON API calls (and keeping it at minimum) I can operate it stable for hours. So the problem is definitely network related (not necessarily in network stack).

blazoncek commented 2 years ago

Looks like ESP32-C3 works flawlessly with:

platform = espressif32@5.1.1
framework = arduino
board = esp32-c3-devkitm-1
board_build.partitions = tools/WLED_ESP32_4MB_1MB_FS.csv
build_flags = ${common.build_flags} ${esp32c3.build_flags} #-D WLED_RELEASE_NAME=ESP32-C3
  -D WLED_WATCHDOG_TIMEOUT=0
Kyry11 commented 2 years ago

Looks like ESP32-C3 works flawlessly with:

platform = espressif32@5.1.1
framework = arduino
board = esp32-c3-devkitm-1
board_build.partitions = tools/WLED_ESP32_4MB_1MB_FS.csv
build_flags = ${common.build_flags} ${esp32c3.build_flags} #-D WLED_RELEASE_NAME=ESP32-C3
  -D WLED_WATCHDOG_TIMEOUT=0

OMG! This is amazing! I have been waiting to get my C3 M5 Stamp to drive my WLED installation for months.. and now it's all working with no restarts!

Thank you so much for the config - it indeed works flawlessly! 🎉

blazoncek commented 2 years ago

There is an issue with classic ESP32 and writing to LittleFS while there are interrupts involved (e.g. driving LEDS). Whenever you try to save preset core 1 panics (cache access). Using platform espressif32@5.1.1 (core 2.0.4, IDF 4.4).

13bporter commented 2 years ago

Only able to get 2 pins to actually work. I can rearrange them to different pins and they'll work but when I add a 3rd one it freezes up when saving and reverts back to what I had before. I've tried driving 30leds on each channel but that doesn't seem to make a difference.

blazoncek commented 2 years ago

Only able to get 2 pins to actually work. I can rearrange them to different pins and they'll work but when I add a 3rd one it freezes up when saving and reverts back to what I had before. I've tried driving 30leds on each channel but that doesn't seem to make a difference.

What has this to do with OP?

softhack007 commented 10 months ago

WLED is working well on many -S3/-S2/-C3 with platform 5.3.0. I assume that bugs in the core libraries were causing problems when network requests and RMT interrupts collided.

Closing.