espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.75k stars 7.3k forks source link

[ESP-WIFI-MESH] Guru Meditation Error found when Fixed root was node not available. (IDFGH-13169) #14109

Open andy-danieal opened 4 months ago

andy-danieal commented 4 months ago

Answers checklist.

General issue report

I encountered an issue that was not resolved after RTC_SW_CPU_RST. The only thing that worked was a hard reset, which meant powering off the device.

ESP-Cache-Issue.txt

I (11013294) mesh: <WIFI_EVENT_SCAN_DONE>status:fail, num:3, id:199
I (11013295) mesh: [IO]enable self-organizing, search parent<adaptive>
Guru Meditation Error: Core  0 panic'ed (Memory protection fault). 
Unknown operation at address 0xffffffff not permitted (INVALID_MEM_TYPE)

Core  0 register dump:
PC      : 0x4003229f  PS      : 0x00060a34  A0      : 0x800320ff  A1      : 0x3fff50a0  
A2      : 0x3ffd5d0c  A3      : 0x3fff58d0  A4      : 0x00060a20  A5      : 0x00000000  
A6      : 0x00000000  A7      : 0x3fff58c8  A8      : 0x7ffeb19c  A9      : 0x00000000  
A10     : 0x00000001  A11     : 0xffffffff  A12     : 0x00000000  A13     : 0x00000000  
A14     : 0x00000001  A15     : 0x3ffdffd8  SAR     : 0x0000001d  EXCCAUSE: 0x00000007  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x40025314  

Backtrace: 0x4003229c:0x3fff50a0 0x400320fc:0x3fff50c0 0x40025b7e:0x3fff50e0 0x400346a1:0x3fff5100 0x400faf5f:0x3fff5120 0x400e9ef9:0x3fff5140 0x400ce221:0x3fff5170 0x400d480a:0x3fff52e0 0x4002d8b6:0x00000000 |<-CORRUPTED

ELF file SHA256: 8765ae40b9c74068

Rebooting...
ESP-ROM:esp32s2-rc4-20191025
Build:Oct 25 2019
rst:0xc (RTC_SW_CPU_RST),boot:0x8 (SPI_FAST_FLASH_BOOT)
Saved PC:0x400254b2
SPIWP:0xee
mode:DIO, clock div:1
load:0x3ffe6110,len:0x186c
load:0x4004b000,len:0x4
load:0x4004b004,len:0xb0c
load:0x4004f000,len:0x3268
entry 0x4004b1cc
I (26) boot: ESP-IDF v5.2.1-dirty 2nd stage bootloader
I (26) boot: compile time Jun 14 2024 11:58:06
I (26) boot: chip revision: v1.0
I (30) boot.esp32s2: SPI Speed      : 80MHz
I (35) boot.esp32s2: SPI Mode       : DIO
I (40) boot.esp32s2: SPI Flash Size : 4MB
I (44) boot: Enabling RNG early entropy source...
I (50) boot: Partition Table:
I (53) boot: ## Label            Usage          Type ST Offset   Length
I (61) boot:  0 nvs              WiFi data        01 02 0000d000 00004000
I (68) boot:  1 otadata          OTA data         01 00 00011000 00002000
I (76) boot:  2 phy_init         RF data          01 01 00013000 00001000
I (83) boot:  3 edf              WiFi data        01 02 00014000 00004000
I (90) boot:  4 storage          Unknown data     01 81 00018000 00020000
I (98) boot:  5 ota_0            OTA app          00 10 00040000 00170000
I (105) boot:  6 ota_1            OTA app          00 11 001b0000 00170000
I (113) boot: End of partition table
I (117) esp_image: segment 0: paddr=00040020 vaddr=3f000020 size=4a5dch (304604) map
I (187) esp_image: segment 1: paddr=0008a604 vaddr=3ffc9bc0 size=03e34h ( 15924) load
I (190) esp_image: segment 2: paddr=0008e440 vaddr=40024000 size=01bd8h (  7128) load
I (194) esp_image: segment 3: paddr=00090020 vaddr=40080020 size=e1c78h (924792) map
I (385) esp_image: segment 4: paddr=00171ca0 vaddr=40025bd8 size=13fe8h ( 81896) load
I (417) boot: Loaded app from partition at offset 0x40000
I (417) boot: Disabling RNG early entropy source...
I (428) cpu_start: Unicore app
I (428) cache: Instruction cache     : size 8KB, 4Ways, cache line size 32Byte
I (431) cache: Data cache        : size 8KB, 4Ways, cache line size 32Byte
Guru Meditation Error: Core  0 panic'ed (Cache disabled but cached memory region accessed). 

Core  0 register dump:
PC      : 0x40039b8d  PS      : 0x00060d34  A0      : 0x80025ea2  A1      : 0x3fffe260  
A2      : 0x02000000  A3      : 0x00000000  A4      : 0x02000000  A5      : 0x00000003  
A6      : 0x00000000  A7      : 0x00000000  A8      : 0x00000014  A9      : 0x00000001  
A10     : 0x00000028  A11     : 0x00000028  A12     : 0x00000800  A13     : 0x00000003  
A14     : 0x00000000  A15     : 0x00000004  SAR     : 0x00000007  EXCCAUSE: 0x00000007  
EXCVADDR: 0x00000000  LBEG    : 0x00000800  LEND    : 0x00000003  LCOUNT  : 0x40025314  

Backtrace: 0x40039b8a:0x3fffe260 0x40025e9f:0x3fffe270 0x40085de1:0x3fffe290 0x400864a8:0x3fffe2b0 0x40025293:0x3fffe2f0 0x40050439:0x3fffe330 0x40050879:0x3fffe370 0x4004b235:0x3fffe4a0 0x40011371:0x3fffe560 0x4000f788:0x3fffe6f0 0x400072a5:0x3fffe710
zhangyanjiaoesp commented 4 months ago

@andy-danieal Can you provide the .elf file when the issue happen?

andy-danieal commented 4 months ago

I have sent you an email. Please check email and let me know if you have any questions.

zhangyanjiaoesp commented 4 months ago

@andy-danieal I didn't receive any email that include a .elf file.

andy-danieal commented 4 months ago

We sent you an email Ma li at Jul 1, 2024. Please check that email.

Also, We having the same issue with the Root Mesh node. After being active for two days, it stops working. Each 10 seconds, 30 ESP devices send a command and receive a response.

ceerqingtingml commented 4 months ago

I am Ma Li. I didn't receive your email. If you prefer not to upload the elf file to GitHub, you can send it directly to zhangyanjiao@espressif.com.

andy-danieal commented 4 months ago

@zhangyanjiaoesp, I have sent on email. please check it.

zhangyanjiaoesp commented 4 months ago

@andy-danieal That's weird. I didn't get your email either. You can upload your elf file here and share the link here.

And you can set the deletion conditions: 图像

andy-danieal commented 4 months ago

@zhangyanjiaoesp,

Download Link

zhangyanjiaoesp commented 4 months ago

@zhangyanjiaoesp,

Download Link

Have download it.

zhangyanjiaoesp commented 4 months ago

@andy-danieal Is the elf file the elf when the crash occurred? root or child ?

andy-danieal commented 4 months ago

ESP-Cache-Issue.txt

We have already shared Child log where is crash occurred.

zhangyanjiaoesp commented 4 months ago

@andy-danieal Please use this wifi lib to test, thanks. wifi_lib_s2_0710.zip

wifi firmware version: fabad8c

andy-danieal commented 4 months ago

@zhangyanjiaoesp, How do I add an existing project to this wifi lib.? And Can you please share the step for us?

zhangyanjiaoesp commented 4 months ago

@andy-danieal replace the wifi libs in idf/components/esp_wifi/lib/esp32s2

andy-danieal commented 4 months ago

@zhangyanjiaoesp, We have tested the wifi lib, but we are still getting the same issue. and We noticed that the issue exists after the device has been running for 6 to 7 hours.

Also, We have attached the logs we captured using Putty. Unfortunately, we were unable to capture the entire log due to disconnections. We will attempt to recreate the issue and capture the complete log.

zhangyanjiaoesp commented 4 months ago

@zhangyanjiaoesp, We have tested the wifi lib, but we are still getting the same issue. and We noticed that the issue exists after the device has been running for 6 to 7 hours.

Also, We have attached the logs we captured using Putty. Unfortunately, we were unable to capture the entire log due to disconnections. We will attempt to recreate the issue and capture the complete log.

Ok, waiting for your logs, I have added some debug logs in the wifi lib, maybe it can help us find the root cause.

andy-danieal commented 3 months ago

@zhangyanjiaoesp,

We attached a log file. and this issue was found in Root Node. gw-19-7.txt

zhangyanjiaoesp commented 3 months ago

@andy-danieal Are you using the wifi lib I provided? Can you open the wifi information log? I can't get any useful information (including the debug information I added) from the log. And you can enable the following option, thus we can get some back trace information when the crash happen. image

andy-danieal commented 3 months ago

@zhangyanjiaoesp, Thank you for your support. We have added a root node for a long-term test, and unfortunately, the original issue has not reappeared. However, we did encounter another issue.


Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x40032a62  PS      : 0x00060d33  A0      : 0x80032400  A1      : 0x3ffe14e0  
A2      : 0x3fff7ab4  A3      : 0x00000030  A4      : 0x3f03cac2  A5      : 0x00000006  
A6      : 0x00000000  A7      : 0x3ffd5a1c  A8      : 0x00000001  A9      : 0x3ffd5a1c  
A10     : 0x00000002  A11     : 0x01000217  A12     : 0x00000024  A13     : 0x3ffd5a40  
A14     : 0x00000001  A15     : 0x4002c0e4  SAR     : 0x0000001f  EXCCAUSE: 0x0000001d  
EXCVADDR: 0x01000223  LBEG    : 0x00000024  LEND    : 0x3ffd5a40  LCOUNT  : 0x40026b5c  

Backtrace: 0x40032a5f:0x3ffe14e0 0x400323fd:0x3ffe1500 0x40025a05:0x3ffe1520 0x40025a60:0x3ffe1540 0x40025a95:0x3ffe1560 0x40034905:0x3ffe1580 0x400285c1:0x3ffe15a0 0x400e2d67:0x3ffe15c0 0x400e2dac:0x3ffe15e0 0x400e3872:0x3ffe1600 0x400e8259:0x3ffe17c0 0x400ea0c6:0x3ffe17e0 0x400e8467:0x3ffe1840 0x40037fa1:0x3ffe1860 0x400360a4:0x3ffe1880 0x4002e1c6:0x3ffe18b0

As per Core dump core-dump-decode.txt

==================== CURRENT THREAD STACK =====================
#0  remove_free_block (sl=1, fl=2, block=0x3fff7ab4, control=0x3ffd5a1c) at C:/Espressif/frameworks/esp-idf-v5.2.1/components/heap/tlsf/tlsf.c:332
#1  block_locate_free (size=<optimized out>, control=<optimized out>) at C:/Espressif/frameworks/esp-idf-v5.2.1/components/heap/tlsf/tlsf.c:567
#2  tlsf_malloc (tlsf=0x3ffd5a1c, size=<optimized out>) at C:/Espressif/frameworks/esp-idf-v5.2.1/components/heap/tlsf/tlsf.c:1005
#3  0x40032400 in multi_heap_malloc_impl (heap=0x3ffd5a08, size=48) at C:/Espressif/frameworks/esp-idf-v5.2.1/components/heap/multi_heap.c:210
#4  0x40025a08 in heap_caps_malloc_base (size=48, caps=6144) at C:/Espressif/frameworks/esp-idf-v5.2.1/components/heap/heap_caps.c:179
#5  0x40025a63 in heap_caps_malloc (size=48, caps=6144) at C:/Espressif/frameworks/esp-idf-v5.2.1/components/heap/heap_caps.c:202
#6  0x40025a98 in heap_caps_malloc_default (size=48) at C:/Espressif/frameworks/esp-idf-v5.2.1/components/heap/heap_caps.c:228
#7  0x40034908 in malloc (size=48) at C:/Espressif/frameworks/esp-idf-v5.2.1/components/newlib/heap.c:24
#8  0x400285c4 in wifi_malloc (size=48) at C:/Espressif/frameworks/esp-idf-v5.2.1/components/esp_wifi/esp32s2/esp_adapter.c:65
#9  0x400e2d6a in mesh_malloc ()
#10 0x400e2daf in esp_mesh_create_context ()
#11 0x400e3875 in esp_mesh_wifi_recv_cb ()
#12 0x400e825c in hostap_deliver_data ()
#13 0x400ea0c9 in hostap_input ()
#14 0x400e846a in ap_rx_cb ()
#15 0x40037fa4 in ppRxPkt ()
#16 0x400360a7 in ppTask ()
#17 0x4002e1c9 in vPortTaskWrapper (pxCode=0x40035fb4 <ppTask>, pvParameters=0x0) at C:/Espressif/frameworks/esp-idf-v5.2.1/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134
zhangyanjiaoesp commented 3 months ago

@andy-danieal please set the Heap memory debugging like the following and test again, thanks.

image

Sherry616 commented 2 months ago

Thanks for reporting, will close due to short of feedback, feel free to reopen with more updates. Thanks for using our Espressif product!

andy-danieal commented 1 month ago

@zhangyanjiaoesp, Having tested a fixed root node that hung after a 10-20 day interval, it has worked fine after being powered off. The same issue occurred on a child device when the root node hung; some devices were affected, but not all.

Attached root UART dump and log file. log.txt log-uart-dump.txt

andy-danieal commented 1 month ago

Any Update?

zhangyanjiaoesp commented 1 month ago

@andy-danieal The current crash issue appears to be different from the previous one. Are you using the version where the debug logs were added last time?

@andy-danieal Please use this wifi lib to test, thanks. wifi_lib_s2_0710.zip

wifi firmware version: fabad8c

andy-danieal commented 1 month ago

@zhangyanjiaoesp, Both cases threw an exception on a Memory protection fault. After that, the device continuously reset due to a Cached memory region exception. We couldn't recover the device without a power supply, which was a terrible situation for the client side.

We can't update that library on the client side, but we will demonstrate the setup and wait to reproduce the issue.

Also, We need recovery method cached memory region exception without power supply. and root cause of that error?

zhangyanjiaoesp commented 1 month ago

@andy-danieal Although each crash is caused by accessing an illegal address, the disassembly paths differ each time. This is likely due to memory corruption, which makes it difficult to identify the root cause without a reliable reproduction method. The Cached memory region exception Issue arises from the first crash, so we need to pinpoint the root cause of the first crash. Could you provide a stable reproduction method?