Open InterfacerCompany opened 6 months ago
Here is my SDK config file:
We have found that if we do have more than 12 devices in the same network it seens that the wifi crash issue happen more frenquently, with 25 devices it crashes 2 out 25 daily in random hours, sometimes takes 4 hours to come back, sometimes 3 minutes, even without reboot and without any explanation.
Some crashes doesnt recover and it shows the wifi interface connected but there is no tcp/ip or udp reply.
BLE still working perfect in this case but it NIMBLE.
We are using 5.2.1 and could not found what is the root cause of it yet.
hi @filzek @InterfacerCompany ,so your crash issue will happen both on v5.1.3 and v5.2.1
and could you describe your application scenario? I think you use the nimble and wifi coexist, and connecting to a TCP server, uploading data at regular intervals?
Hi @Xiehanxin,
We've been working hard to ensure the WiFi driver and the lwIP stack don't cause any issues. We've made some intriguing discoveries, but we're not ready to share details just yet. Tomorrow, we'll conduct an extensive debugging session across 57 devices that have been specially configured via the sdkconfig. This setup aims to stabilize the system and prevent crashes, and we're hopeful it will resolve the issues.
As of now, the latest firmware build, which only includes changes to the SDKCONFIG, has been running smoothly without any problems for the past 48 hours. Although these adjustments might seem a bit unusual or counterintuitive, they appear to be effective. We'll provide a more detailed update once we verify the fixes in tomorrow's session.
Also we have add this boards to the esp-insights, we can share the results if you want to be part of the dashboard, or if you have access it is dashboard-id=18558382-e648-46c0-9a8e-f9243b2f0dd0
We still have problem with the POWER on Nimble, sometimes the power get very very low, like -50 to -80 dBm.
Hi @Xiehanxin. I'm not tested on v5.2.1 yet. But previously my project used v5.0.6 - similar situation, it crashed randomly. I'm updated it to v5.1.3 hoping that it will be stable. The logic of my application is next:
You can see decoded core dump after the system is crashed. I will provide more from different devices.
All the time my devices in WiFi STA mode and connected to router.
hi @InterfacerCompany @filzek it seems that it run out of the memory, Could you add a task to periodically use _esp_get_minimum_free_heapsize to print the remaining memory
Hi @Xiehanxin. I don't think so. My app all the time monitoring the free heap size, and fragmentation. I also have a special logic that will reboot esp32 if the memory becomes too low or fragmentation too big. But I don't see that logic become active and reboot the board. But it is only for critical cases, normally my app is not going out of memory. Here is the memory status during system runtime:
[FREE-HEAP ]-0000076003-C0-I- U8: 82.47 KBytes(min: 77.07 KBytes, max: 104.76 KBytes), U32: 6.30 KBytes, Max block: 62.00 KBytes, Frag: 25%
[FREE-HEAP ]-0000086005-C0-I- U8: 82.45 KBytes(min: 77.07 KBytes, max: 104.76 KBytes), U32: 6.30 KBytes, Max block: 62.00 KBytes, Frag: 25%
[FREE-HEAP ]-0000096008-C0-I- U8: 82.47 KBytes(min: 77.07 KBytes, max: 104.76 KBytes), U32: 6.30 KBytes, Max block: 62.00 KBytes, Frag: 25%
I have a similar thing on ESP32 (ESP32-D0WDQ6 v1.0) and esp32s3. The straight esp32cam ('aithinker' style) board is much worse than the s3 board. In my case, it's hosting a webserver, reading video and streaming mjpg to the browser, whilst listening for BLE adverts, and occasionally connecting to one BLE device at a time to gather data). However, the failure can happen randomly at any time. e.g. it just crashed immediately after boot and nimble startup with me accessing a plain text webpage with no video flowing, and probably no adverts received, and no BLE connections in progress.
The symptom is that from the serial all seems normal, but the device can no longer deliver data, and is no longer available to ping on the wifi. Unfortunately, I can't supply detailed debug logs, as I'm not competent at modifying aspects of the compilation. The firmware in use is Tasmota (latest dev uses IDF 5.2.1). I understand that I am pushing it really hard; but I would hope that this would just degrade performance, not result in complete network loss.
If I don't enable Nimble at startup, then the issue does not occur, so it feels like a coexistence issue.
Continuing this topic.
I've switched to v5.1.4 - the issue is NOT resolved.
I've collected more core-dump data from different devices. Interesting that 5 devices connected to the same network were crashed(assert failed) at the same time in the same function 'block_trim_free'.
So, devices 86, 87, 89 -> 'assert failed' in the 'wifi' thread in the 'block_trim_free' function.
Device 90 -> 'assert failed' in the 'mdns' thread in the 'block_trim_free' function.
Device 91 -> strange core dump file, it does not provide all information. Just this "Panic reason: assert failed: block_trim_free tlsf". So, same function 'block_trim_free'.
In the attachment, you can find decoded core dump files for each device. Also here are some statistics data for all devices. There you can see up-time, reboot reason, memory usage, rps for servers.
`// 86 ---------------------------------
{"ts":87162508,"upTime":"1day,00:09:21","locTime":"2024.06.02 13:38:43","rst":{"cntr":1,"code":4,"descr":"Software reset due to exception/panic","cdf":true,"cpu0":{"code":12, "descr":"Software reset CPU"},"cpu1":{"code":14, "descr":"for APP CPU, reseted by PRO CPU"}},"fs":{"total":"11.81 MBytes", "used":"1.13 MBytes", "free":"10.68 MBytes", "recCntr":0},"perfomance": {"cpu0Load":83, "cpu1Load":0},"mem":{"u8": {"cur": "86.11 KBytes", "min": "79.58 KBytes", "max": "110.72 KBytes", "maxBlock": "72.00 KBytes", "frag":"17%"},"u32": {"cur":"16.04 KBytes"}},"httpServ":{"rps":0,"rpsLatest":1,"rpsMax":3},"mbTcpServ":{"rps":18,"rpsLatest":18,"rpsMax":43},"sysWdtRst": {"cntr":0, "code":0, "descr":"RESET_REASON_NOT_PERF", "date":"2024-05-31", "time":"14:00:34"}}
// 87 --------------------------------- {"ts":87202247,"upTime":"1day,00:09:52","locTime":"2024.06.02 13:39:30","rst":{"cntr":1,"code":4,"descr":"Software reset due to exception/panic","cdf":true,"cpu0":{"code":12, "descr":"Software reset CPU"},"cpu1":{"code":14, "descr":"for APP CPU, reseted by PRO CPU"}},"fs":{"total":"11.80 MBytes", "used":"1.13 MBytes", "free":"10.67 MBytes", "recCntr":0},"perfomance": {"cpu0Load":82, "cpu1Load":0},"mem":{"u8": {"cur": "87.96 KBytes", "min": "76.25 KBytes", "max": "110.98 KBytes", "maxBlock": "76.00 KBytes", "frag":"14%"},"u32": {"cur":"16.04 KBytes"}},"httpServ":{"rps":0,"rpsLatest":1,"rpsMax":9},"mbTcpServ":{"rps":0,"rpsLatest":1,"rpsMax":42},"sysWdtRst": {"cntr":0, "code":0, "descr":"RESET_REASON_NOT_PERF", "date":"2024-05-31", "time":"14:01:39"}}
// 89 --------------------------------- {"ts":87235205,"upTime":"1day,00:10:27","locTime":"2024.06.02 13:39:55","rst":{"cntr":1,"code":4,"descr":"Software reset due to exception/panic","cdf":true,"cpu0":{"code":12, "descr":"Software reset CPU"},"cpu1":{"code":14, "descr":"for APP CPU, reseted by PRO CPU"}},"fs":{"total":"11.81 MBytes", "used":"1.13 MBytes", "free":"10.68 MBytes", "recCntr":0},"perfomance": {"cpu0Load":81, "cpu1Load":0},"mem":{"u8": {"cur": "87.91 KBytes", "min": "80.85 KBytes", "max": "110.33 KBytes", "maxBlock": "76.00 KBytes", "frag":"14%"},"u32": {"cur":"16.04 KBytes"}},"httpServ":{"rps":0,"rpsLatest":1,"rpsMax":3},"mbTcpServ":{"rps":1,"rpsLatest":1,"rpsMax":43},"sysWdtRst": {"cntr":0, "code":0, "descr":"RESET_REASON_NOT_PERF", "date":"2024-05-31", "time":"14:03:56"}}
// 90 --------------------------------- {"ts":87304159,"upTime":"1day,00:11:36","locTime":"2024.06.02 13:41:05","rst":{"cntr":1,"code":4,"descr":"Software reset due to exception/panic","cdf":true,"cpu0":{"code":12, "descr":"Software reset CPU"},"cpu1":{"code":14, "descr":"for APP CPU, reseted by PRO CPU"}},"fs":{"total":"11.81 MBytes", "used":"1.13 MBytes", "free":"10.68 MBytes", "recCntr":0},"perfomance": {"cpu0Load":82, "cpu1Load":0},"mem":{"u8": {"cur": "88.94 KBytes", "min": "74.91 KBytes", "max": "110.54 KBytes", "maxBlock": "76.00 KBytes", "frag":"15%"},"u32": {"cur":"16.04 KBytes"}},"httpServ":{"rps":0,"rpsLatest":1,"rpsMax":3},"mbTcpServ":{"rps":1,"rpsLatest":1,"rpsMax":50},"sysWdtRst": {"cntr":0, "code":0, "descr":"RESET_REASON_NOT_PERF", "date":"2024-05-31", "time":"14:04:34"}}
// 91 --------------------------------- {"ts":87277908,"upTime":"1day,00:11:12","locTime":"2024.06.02 13:40:44","rst":{"cntr":1,"code":4,"descr":"Software reset due to exception/panic","cdf":true,"cpu0":{"code":12, "descr":"Software reset CPU"},"cpu1":{"code":14, "descr":"for APP CPU, reseted by PRO CPU"}},"fs":{"total":"11.80 MBytes", "used":"40.00 KBytes", "free":"11.77 MBytes", "recCntr":0},"perfomance": {"cpu0Load":82, "cpu1Load":0},"mem":{"u8": {"cur": "88.23 KBytes", "min": "79.47 KBytes", "max": "109.73 KBytes", "maxBlock": "72.00 KBytes", "frag":"19%"},"u32": {"cur":"16.04 KBytes"}},"httpServ":{"rps":0,"rpsLatest":1,"rpsMax":3},"mbTcpServ":{"rps":1,"rpsLatest":1,"rpsMax":36},"sysWdtRst": {"cntr":0, "code":0, "descr":"RESET_REASON_NOT_PERF", "date":"2024-05-31", "time":"14:07:18"}}`
Does anybody have an idea how it can be fixed? decoded_86.txt decoded_87.txt decoded_89.txt decoded_90.txt decoded_91.txt
@Xiehanxin Any feedback about the log in https://github.com/espressif/esp-idf/issues/13721#issuecomment-2143901081 ?
hi @InterfacerCompany @filzek it seems that it run out of the memory, Could you add a task to periodically use _esp_get_minimum_free_heapsize to print the remaining memory
showMemoryRAMStatus ======================================================= showMemoryRAMStatus The current date/time is: Wed Jun 5 11:16:38 2024 showMemoryRAMStatus There is 15 hardwares ports created showMemoryRAMStatus FreeHeapSize => 2664956 bytes showMemoryRAMStatus Internal Heap Size => 19308 bytes showMemoryRAMStatus =======================================================
the memory is not the problem, as the heap is constant and always have free memory
We use esp32 3.0 wirth psiram enable and alloc all the heap in spiram, so, the internal/default memory is always constant free.
we suspect that BLE driver and Wifi driver sometime runs a concurrence and crash, also we have found some very odd bahvior in the WiFi connection:
1) Using a Deco M5 router in mesh or stand alone with more than 12 devices connected to it the crashes become more frequently. 2) Using a Starlink router the crashes almost doenst happen
So, something related to the WiFi is crashing the boards.
Hi guys. I continue facing this problem. I've switched to v5.2.2 - the issue is NOT resolved.
Devices 85,86,87: Crashed task handle: 0x3ffd929c, name: 'wifi', GDB name: 'process 1073582748' Crashed task is not in the interrupt context Panic reason: assert failed: block_trim_free tlsf.c:496 (block_is_free(block) && "block must be free") exccause 0x1d (StoreProhibitedCause)
Device 88: Crashed task handle: 0x3ffb3c28, name: 'mdns', GDB name: 'process 1073429544' Crashed task is not in the interrupt context Panic reason: assert failed: insert_free_block tlsf.c:358 (current && "free list cannot have a null entry") exccause 0x1d (StoreProhibitedCause)
Devices 90,91: Crashed task handle: 0x3ffd8dd4, name: 'wifi', GDB name: 'process 1073581524' Crashed task is not in the interrupt context exccause 0x1c (LoadProhibitedCause)
So, it seems during operation with memory some variables become broken and FW crashes when trying to free a broken pointer : block_trim_free tlsf.c:496 (block_is_free(block) && "block must be free") or some other opperations insert_free_block tlsf.c:358 (current && "free list cannot have a null entry") in case of mdns code.
Here is also decoded core dump files: 85_decoded.txt 86_decoded.txt 87_decoded.txt 88_decoded.txt 90_decoded.txt
Also, it seems that v5.2.2 in the core-dump file, the 'used stack' size for the task is not correct. Examples: v5.2.2 TCB NAME PRIO C/B STACK USED/FREE
0x3ffd929c wifi23/1073582736 1073581360/4760 0x3ffb7c5c IDLE0/1073463552 1073463152/1124 0x3ffb61e4mb_tcp_serv_thread1/1073624992 1073624400/3488 0x3ffbd450 log_thread1/1073468480 1073467968/1524 0x3ffd3b98common_no_blocking_thread1/1073560464 1073560000/2604 0x3ffd5ba8 tiT18/1073568672 1073568144/2540 0x3ffe72b4common_blocking_thread1/1073638368 1073637888/3612 0x3ffd4b70 charger_thread1/1073564512 1073563792/2852 0x3ffb5f84 mb_cli_thread1/1073608656 1073608128/2540 0x3ffdd948 wifi_app_thread1/1073605568 1073604912/2400 0x3ffd70c4 sys_evt20/1073574064 1073573504/3776 0x3ffddda4 ssh_serv_thread1/1073630128 1073628352/3340 0x3ffed3dc nimble_host21/1073664976 1073664416/3528 0x3ffb4ab8 mdns1/1073433264 1073432688/3516 0x3ffb5d48 async_tcp1/1073620896 1073620416/7700 0x3ffb7698 esp_timer22/1073457920 1073457456/3116 0x3ffe7050 btController23/1073655232 1073654720/3064
v5.1.4 TCB NAME PRIO C/B STACK USED/FREE
0x3ffd9228 wifi 23/23 528/5612 0x3ffea560 btController 23/23 512/3060 0x3ffb7698 esp_timer 22/22 464/3116 0x3ffb5d44 async_tcp 1/1 528/7652 0x3ffb5f80 mb_cli_thread 1/1 528/2528 0x3ffd3b24common_no_blocking_thread 1/1 560/2496 0x3ffe66e8common_blocking_thread 1/1 608/3484 0x3ffddaecmb_tcp_serv_thread 1/1 624/3456 0x3ffb7c5c IDLE 0/0 400/1124 0x3ffd4afc charger_thread 1/1 720/2856 0x3ffbd444 log_thread 1/1 512/1520 0x3ffdd8d4 wifi_app_thread 1/1 656/2404 0x3ffd5b34 tiT 18/18 528/2528 0x3ffddd6c ssh_serv_thread 1/1 1776/3340 0x3ffec5f8 nimble_host 21/21 560/3532 0x3ffd7050 sys_evt 20/20 560/3780 0x3ffb3c24 mdns 1/1 528/3552
Can somebody check the decoded core dump files and help me with this problem. Any ideas is welcome ))) Maybe somebody from the esp-idf team is here.
Some additional information about my project. I'm trying to keep heap memory as big as possible. Unfortunately, I can't add SPI SRAM to my current HW. So, here are options that are activated to save RAM memory:
LWIP_TCPIP_CORE_LOCKING and LWIP_TCPIP_CORE_LOCKING_INPUT are not reducing the heap memory but seems if those activated systems become more stable. Also, I'm using IRAM_8BIT memory for my data buffers. Here is the amount of RAM memory during my system runtime:
"mem":{"u8": {"cur": "87.68 KBytes", "min": "67.14 KBytes", "max": "111.09 KBytes", "maxBlock": "72.00 KBytes", "frag":"18%"},"u32": {"cur":"14.03 KBytes"}} where: u8 - internal heap memory u32 - IRAM_8BIT memory
Does somebody see problems with such configuration?
Answers checklist.
IDF version.
v5.1.3
Espressif SoC revision.
ESP32-D0WD-V3
Operating System used.
Windows
How did you build your project?
Command line with idf.py
If you are using Windows, please specify command line type.
CMD
Development Kit.
custom PCB
Power Supply used.
External 5V
What is the expected behavior?
I expected that my device would work continuously for a long period of time without crashes and reboots.
What is the actual behavior?
After working for a while, my device crashes and reboots.
Steps to reproduce.
Unfortunately, I don't have the steps to reproduce. It just happens with some period. It can crash a few times during the day. Or even it can be working 24 hours and then crash.
Debug Logs.
More Information.
I have several devices in the fields and they crashed at different times. Most of them in 'wifi' thread, sometimes in 'tiT' thread. I used WI-FI and BLE(Host is NimBLE - BLE only ) communication at the same time. WI-FI power save mode is WIFI_PS_MIN_MODEM. I have also enabled LWIP_TCPIP_CORE_LOCKING and LWIP_TCPIP_CORE_LOCKING_INPUT . In my opinion, after enabling it my device starts working more stable(increases time before crash).