espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.71k stars 7.3k forks source link

WiFi sample code causes a StoreProhibited exception while authenticating WiFi credentials (IDFGH-9708) #11045

Open hollingerc opened 1 year ago

hollingerc commented 1 year ago

Answers checklist.

IDF version.

v5.1-dev-1908-g439a709c42

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

ESP32-S3-WROOM-1 on an ESP-DevKit-C

Power Supply used.

USB

What is the expected behavior?

I would expect the ESP32 to successfully log into my WiFi AP without crashing. My code previously ran successfully on a WT32-SC01 board from Waveshare. This board uses an ESP32-WROVER-B, and has an SPI LCD with capacitive touch screen (I2C touch controller).

My code is running LVGL in a task, running a 24-bit RGB LCD with an 8-bit data bus and capacitive touch screen (I2C touch controller), and running the sample WiFi code from esp-idf/examples/wifi/getting_started/station.

What is the actual behavior?

Instead of logging into my WiFi AP, a StoreProhibited exception is caused when the ESP32-S3 is authenticating the WiFi credentials. As can be seen from the debug log, the WiFi process was started, but only got so far. This is repeatable, happens every time I run the code.

My code runs an example from esp-idf/examples/wifi/getting_started/station and worked without crashing on another board that used an ESP32-WROVER-B.

Steps to reproduce.

  1. From a Linux terminal window, compile, flash and run with this command: idf.py -p /dev/ttyUSB0 flash monitor
  2. After the program crashes, repeat #1. Crashes every time. ... This is what app_main looks like:

void app_main(void) { BaseType_t error = pdPASS; esp_err_t nvs_err = ESP_OK;

ESP_LOGI(TAG, "app_main()");

nvs_err = nvs_flash_init(); if(nvs_err == ESP_ERR_NVS_NO_FREE_PAGES || nvs_err == ESP_ERR_NVS_NEW_VERSION_FOUND) { ESP_ERROR_CHECK(nvs_flash_erase()); nvs_err = nvs_flash_init(); } ESP_ERROR_CHECK(nvs_err);

/ The LCD and touch screen are initialized here. LVGL is initialize and a task is created running on Core 1. / (void)btd_lvgl_port_init();

/ Turn on the LCD backlight. / (void)btd_lcd_backlight(BTD_LCD_BACKLIGHT_ON);

/ Create three LVGL screens with buttons and text on each. Load one of the screens. / (void)btd_gui_main_start();

/ This starts the WiFi process. The code here is just a copy of esp-idf/examples/wifi/getting_started/station. / btd_wifi_start(); } I can provide software listings for the functions called in app_main if required.

Debug Logs.

--- idf_monitor on /dev/ttyUSB0 115200 ---
--- Quit: Ctrl+] | Menu: Ctrl+T | Help: Ctrl+T followed by Ctrl+H ---
ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0x1 (POWERON),boot:0xa (SPI_FAST_FLASH_BOOT)
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3810,len:0x167c
load:0x403c9700,len:0xbe0
load:0x403cc700,len:0x2dfc
entry 0x403c9900
I (24) boot: ESP-IDF v5.1-dev-1908-g439a709c42-dirty 2nd stage bootloader
I (25) boot: compile time Mar 17 2023 18:30:25
I (26) boot: chip revision: v0.1
I (30) boot.esp32s3: Boot SPI Speed : 80MHz
I (35) boot.esp32s3: SPI Mode       : DIO
I (39) boot.esp32s3: SPI Flash Size : 2MB
I (44) boot: Enabling RNG early entropy source...
I (49) boot: Partition Table:
I (53) boot: ## Label            Usage          Type ST Offset   Length
I (60) boot:  0 nvs              WiFi data        01 02 00009000 00006000
I (68) boot:  1 phy_init         RF data          01 01 0000f000 00001000
I (75) boot:  2 factory          factory app      00 00 00010000 00177000
I (83) boot: End of partition table
I (87) esp_image: segment 0: paddr=00010020 vaddr=3c0b0020 size=341e0h (213472) map
I (134) esp_image: segment 1: paddr=00044208 vaddr=3fc99900 size=056f4h ( 22260) load
I (139) esp_image: segment 2: paddr=00049904 vaddr=40374000 size=06714h ( 26388) load
I (145) esp_image: segment 3: paddr=00050020 vaddr=42000020 size=ad7c8h (710600) map
I (275) esp_image: segment 4: paddr=000fd7f0 vaddr=4037a714 size=0f104h ( 61700) load
I (298) boot: Loaded app from partition at offset 0x10000
I (298) boot: Disabling RNG early entropy source...
I (310) octal_psram: vendor id    : 0x0d (AP)
I (310) octal_psram: dev id       : 0x02 (generation 3)
I (310) octal_psram: density      : 0x03 (64 Mbit)
I (315) octal_psram: good-die     : 0x01 (Pass)
I (321) octal_psram: Latency      : 0x01 (Fixed)
I (326) octal_psram: VCC          : 0x01 (3V)
I (331) octal_psram: SRF          : 0x01 (Fast Refresh)
I (337) octal_psram: BurstType    : 0x01 (Hybrid Wrap)
I (343) octal_psram: BurstLen     : 0x01 (32 Byte)
I (348) octal_psram: Readlatency  : 0x02 (10 cycles@Fixed)
I (354) octal_psram: DriveStrength: 0x00 (1/1)
I (359) esp_psram: Found 8MB PSRAM device
I (364) esp_psram: Speed: 40MHz
I (368) cpu_start: Pro cpu up.
I (372) cpu_start: Starting app cpu, entry point is 0x403756c4
0x403756c4: call_start_cpu1 at /home/craig/esp/esp-idf/components/esp_system/port/cpu_start.c:145

I (0) cpu_start: App cpu up.
I (1109) esp_psram: SPI SRAM memory test OK
I (1118) cpu_start: Pro cpu start user code
I (1118) cpu_start: cpu freq: 240000000 Hz
I (1119) cpu_start: Application information:
I (1122) cpu_start: Project name:     bt_display
I (1127) cpu_start: App version:      1
I (1132) cpu_start: Compile time:     Mar 17 2023 18:30:16
I (1138) cpu_start: ELF file SHA256:  5584ac9da645ee21...
I (1144) cpu_start: ESP-IDF:          v5.1-dev-1908-g439a709c42-dirty
I (1151) cpu_start: Min chip rev:     v0.0
I (1156) cpu_start: Max chip rev:     v0.99 
I (1161) cpu_start: Chip rev:         v0.1
I (1165) heap_init: Initializing. RAM available for dynamic allocation:
I (1173) heap_init: At 3FCAB3C8 len 0003E348 (248 KiB): D/IRAM
I (1179) heap_init: At 3FCE9710 len 00005724 (21 KiB): STACK/DRAM
I (1186) heap_init: At 3FCF0000 len 00008000 (32 KiB): DRAM
I (1192) heap_init: At 600FE010 len 00001FF0 (7 KiB): RTCRAM
I (1199) esp_psram: Adding pool of 8192K of PSRAM memory to heap allocator
I (1207) spi_flash: detected chip: generic
I (1211) spi_flash: flash io: dio
W (1215) spi_flash: Detected size(8192k) larger than the size in the binary image header(2048k). Using the size in the binary image header.
I (1229) coexist: coexist rom version e7ae62f
I (1234) cpu_start: Starting scheduler on PRO CPU.
I (0) cpu_start: Starting scheduler on APP CPU.
I (1254) esp_psram: Reserving pool of 32K of internal memory for DMA/internal allocations
I (1254) btd_display.c: app_main()
I (1274) btd_event.c: btd_event_init()
I (1274) btd_lvgl_port.c: lvgl_port_init()
I (1274) btd_lcd.c: btd_lcd_init()
I (1274) btd_lcd.c: Install RGB LCD panel driver
I (1304) esp_lcd_rgb_panel: new rgb panel(0) @0x3fcabb08, fb0 @0x3c0f0900, fb1 @0x0, fb_size=391680, bb0 @0x0, bb1 @0x0, bb_size=0
I (1304) btd_lcd.c: Register event callbacks
I (1304) btd_lcd.c: Initialize RGB LCD panel
I (1314) gpio: GPIO[14]| InputEn: 0| OutputEn: 1| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0 
I (1324) btd_lvgl_port.c: lvgl_port_disp_init()
I (1324) btd_lvgl_port.c: Allocate separate LVGL draw buffers.
I (1334) btd_touch.c: btd_touch_init()
I (1344) m_i2c.c: m_i2c_master_init()
I (1344) gt911.c: gt911_init()
I (1344) gt911.c: Vendor ID: 0.
I (1354) gt911.c: Touch resolution x: 480, y: 272
I (1354) btd_lvgl_port.c: lvgl_port_indev_init()
I (1364) btd_lvgl_port.c: LCD resolution x: 480, y: 272
I (1364) btd_lvgl_port.c: Touch resolution x: 480, y: 272
I (1374) btd_lvgl_port.c: Touch scaling factors x: 100, y: 100
I (1384) btd_lvgl_port.c: LVGL screen resolution x: 480, y: 272
I (1384) btd_lvgl_port.c: lvgl_tick_init()
I (1394) btd_gui.c: btd_gui_main_start()
I (22664) btd_wifi: btd_wifi_start()
I (22664) pp: pp rom version: e7ae62f
I (22664) net80211: net80211 rom version: e7ae62f
I (22674) wifi:wifi driver task: 3fcf6e58, prio:23, stack:6656, core=0
I (22684) wifi:wifi firmware version: b52fcbc
I (22684) wifi:wifi certification version: v7.0
I (22684) wifi:config NVS flash: enabled
I (22684) wifi:config nano formating: disabled
I (22684) wifi:Init data frame dynamic rx buffer num: 32
I (22694) wifi:Init management frame dynamic rx buffer num: 32
I (22694) wifi:Init management short buffer num: 32
I (22704) wifi:Init static tx buffer num: 16
I (22704) wifi:Init tx cache buffer num: 32
I (22714) wifi:Init static tx FG buffer num: 2
I (22714) wifi:Init static rx buffer size: 1600
I (22724) wifi:Init static rx buffer num: 10
I (22724) wifi:Init dynamic rx buffer num: 32
I (22724) wifi_init: rx ba win: 6
I (22734) wifi_init: tcpip mbox: 32
I (22734) wifi_init: udp mbox: 6
I (22744) wifi_init: tcp mbox: 6
I (22744) wifi_init: tcp tx win: 5744
I (22744) wifi_init: tcp rx win: 5744
I (22754) wifi_init: tcp mss: 1440
I (22754) wifi_init: WiFi IRAM OP enabled
I (22764) wifi_init: WiFi RX IRAM OP enabled
I (22764) phy_init: phy_version 503,13653eb,Jun  1 2022,17:47:08
I (22804) wifi:mode : sta (f4:12:fa:e3:a5:38)
I (22804) wifi:enable tsf
I (22814) btd_wifi: event_handler(): connect to the AP
I (22814) wifi:new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
I (22814) wifi:state: init -> auth (b0)
I (22824) wifi:state: auth -> assoc (0)
I (22824) wifi:state: assoc -> run (10)
Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x4038445e  PS      : 0x00060233  A0      : 0x80383d0a  A1      : 0x3fcf66a0  
0x4038445e: remove_free_block at /home/craig/esp/esp-idf/components/heap/tlsf/tlsf.c:332
 (inlined by) block_locate_free at /home/craig/esp/esp-idf/components/heap/tlsf/tlsf.c:567
 (inlined by) tlsf_malloc at /home/craig/esp/esp-idf/components/heap/tlsf/tlsf.c:1004

A2      : 0x3c0f0014  A3      : 0x0000002c  A4      : 0x3c159990  A5      : 0x00000010  
A6      : 0xfffffffc  A7      : 0x0000cdcd  A8      : 0xffcddc39  A9      : 0x0000001c  
A10     : 0xffcddc39  A11     : 0x00000000  A12     : 0xb33fffff  A13     : 0x00000003  
A14     : 0x3fcf69c8  A15     : 0x3fcf69d4  SAR     : 0x0000001f  EXCCAUSE: 0x0000001d  
EXCVADDR: 0xffcddc45  LBEG    : 0x400570e8  LEND    : 0x400570f3  LCOUNT  : 0x00000000  

Backtrace: 0x4038445b:0x3fcf66a0 0x40383d07:0x3fcf66c0 0x40376499:0x3fcf66e0 0x403764d7:0x3fcf6700 0x40376781:0x3fcf6720 0x4203caae:0x3fcf6740 0x4204a4a5:0x3fcf6780 0x4204a58e:0x3fcf67d0 0x4204aa89:0x3fcf6810 0x4204a7b9:0x3fcf6830 0x4204aa5d:0x3fcf6850 0x42048fb1:0x3fcf6890 0x42049091:0x3fcf68b0 0x42047535:0x3fcf68d0 0x42047777:0x3fcf68f0 0x42043462:0x3fcf6950 0x4204358d:0x3fcf6980 0x42043385:0x3fcf69b0 0x42086547:0x3fcf6a10 0x42086861:0x3fcf6b00 0x420875aa:0x3fcf6b30 0x42087e12:0x3fcf6bb0 0x4205baf5:0x3fcf6c00 0x40386015:0x3fcf6c30 0x4038618d:0x3fcf6c80 0x4003f4bd:0x3fcf6ca0 |<-CORRUPTED
0x4038445b: remove_free_block at /home/craig/esp/esp-idf/components/heap/tlsf/tlsf.c:331
 (inlined by) block_locate_free at /home/craig/esp/esp-idf/components/heap/tlsf/tlsf.c:567
 (inlined by) tlsf_malloc at /home/craig/esp/esp-idf/components/heap/tlsf/tlsf.c:1004

0x40383d07: multi_heap_malloc_impl at /home/craig/esp/esp-idf/components/heap/multi_heap.c:220

0x40376499: heap_caps_malloc_base at /home/craig/esp/esp-idf/components/heap/heap_caps.c:146

0x403764d7: heap_caps_calloc_base at /home/craig/esp/esp-idf/components/heap/heap_caps.c:458

0x40376781: heap_caps_calloc at /home/craig/esp/esp-idf/components/heap/heap_caps.c:467

0x4203caae: gdma_new_channel at /home/craig/esp/esp-idf/components/esp_hw_support/gdma.c:140

0x4204a4a5: crypto_shared_gdma_new_channel at /home/craig/esp/esp-idf/components/mbedtls/port/crypto_shared_gdma/esp_crypto_shared_gdma.c:33
 (inlined by) crypto_shared_gdma_init at /home/craig/esp/esp-idf/components/mbedtls/port/crypto_shared_gdma/esp_crypto_shared_gdma.c:67

0x4204a58e: esp_crypto_shared_gdma_start at /home/craig/esp/esp-idf/components/mbedtls/port/crypto_shared_gdma/esp_crypto_shared_gdma.c:104

0x4204aa89: esp_sha_dma_start at /home/craig/esp/esp-idf/components/mbedtls/port/sha/dma/esp_sha_gdma_impl.c:12

0x4204a7b9: esp_sha_dma_process at /home/craig/esp/esp-idf/components/mbedtls/port/sha/dma/sha.c:320

0x4204aa5d: esp_sha_dma at /home/craig/esp/esp-idf/components/mbedtls/port/sha/dma/sha.c:264

0x42048fb1: esp_internal_sha1_dma_process at /home/craig/esp/esp-idf/components/mbedtls/port/sha/dma/esp_sha1.c:96

0x42049091: mbedtls_sha1_update at /home/craig/esp/esp-idf/components/mbedtls/port/sha/dma/esp_sha1.c:150

0x42047535: mbedtls_md_update at /home/craig/esp/esp-idf/components/mbedtls/mbedtls/library/md.c:489

0x42047777: mbedtls_md_hmac_starts at /home/craig/esp/esp-idf/components/mbedtls/mbedtls/library/md.c:674

0x42043462: hmac_vector at /home/craig/esp/esp-idf/components/wpa_supplicant/esp_supplicant/src/crypto/crypto_mbedtls.c:302

0x4204358d: hmac_sha1_vector at /home/craig/esp/esp-idf/components/wpa_supplicant/esp_supplicant/src/crypto/crypto_mbedtls.c:365

0x42043385: sha1_prf at /home/craig/esp/esp-idf/components/wpa_supplicant/src/crypto/sha1-prf.c:51

0x42086547: wpa_pmk_to_ptk at /home/craig/esp/esp-idf/components/wpa_supplicant/src/common/wpa_common.c:1138

0x42086861: wpa_derive_ptk at /home/craig/esp/esp-idf/components/wpa_supplicant/src/rsn_supp/wpa.c:638

0x420875aa: wpa_supplicant_process_1_of_4 at /home/craig/esp/esp-idf/components/wpa_supplicant/src/rsn_supp/wpa.c:705

0x42087e12: wpa_sm_rx_eapol at /home/craig/esp/esp-idf/components/wpa_supplicant/src/rsn_supp/wpa.c:2106

0x4205baf5: sta_rx_eapol at ??:?

0x40386015: sta_input at ??:?

0x4038618d: sta_rx_cb at ??:?

ELF file SHA256: 5584ac9da645ee21

CPU halted.

More Information.

With a JTAG debugger running in Eclipse, I can trace through the last three functions before the exception occurs and follow the arguments passed from one function to the next. It appears that in the function tlsf_malloc(), before or during the call to block_locate_free(), a struct pointed to by an argument gets trashed. The struct contains pointers that get changed to point to invalid memory. They start out pointing to valid memory. When one of these pointers is used in remove_free_block() to access memory, the exception is raised. If I disable the LVGL task, this problem doesn't occur, WiFi authentication completes successfully. It would seem there is an interaction between the LVGL and WiFi tasks.

I've done some searching on the internet and have found posts of others with the same exception in the same functions, although doing something other than WiFi authentication (and no mention of LVGL). These were older posts and referred to older versions of the IDF (4.3 if I remember correctly), but the conclusion I saw was that the problem was fixed in newer versions of the IDF. I'm using version 5.1.

It could be the problem lies with these functions. The comments in the tlsf.c file indicate that these functions are not thread safe and it is expected the developer has to provide the protection. The failure seems to be in the ESP32 WiFi code (or LGVL?) which I have not written.

I'm happy to provide more of my source code if requested.

chegewara commented 1 year ago

Its probably bug in your lvgl code. From my most recent experience, and im not proud of it, it may be memory corruption cause by passing local variable from one function, which is later used by pointer. In my case i missed that bug in someone else code, but it was lame from my side.

Please check your code.

hollingerc commented 1 year ago

Thanks for the response.

Please elaborate - How was that variable passed? From what function? How was the variable used by the pointer?

I've been through my LVGL code and can't see a problem with it (not that there isn't a problem, I just can't see one if there is one). This is how the LVGL task is registered to the RTOS by xTaskCreatePinnedToCore():

static void lvgl_task(void *pvParam) { (void) pvParam;

while(true) { if (pdTRUE == xSemaphoreTake(gui_semaphore, portMAX_DELAY)) { lv_task_handler(); xSemaphoreGive(gui_semaphore); } vTaskDelay(pdMS_TO_TICKS(10)); }

vTaskDelete(NULL); }

gui_semaphore is declared as a static file scope variable and initialized before the task is created.

If I comment out the line with lv_task_handler(), re-compile, flash and run, the WiFi authenticates with no problems, no crashing. If I comment out the code that creates the screens with buttons, text, etc., (so LVGL basically has nothing to do) and leave lv_task_handler() intact, the program crashes as before. There appears to be a conflict between the LVGL task and the WiFi task.

chegewara commented 1 year ago

Still hard to say. What is this task stack? Maybe try to increase it.

hollingerc commented 1 year ago

I did try that. Currently the stack is at 6 1024. When I raised it to 8 1024, the program still crashed. I could try it with an even bigger stack.

Another thing I tried is I setup and ran a heap trace dump. I had to be careful where I put the final statements so that they were printed to the monitor before the program crashed. The WiFi example calls esp_wifi_connect() in an event handler, I put the dump right after the call. This is the result:

172689 bytes 'leaked' in trace (171 allocations) total allocations 185 total frees 14

I (3140) btd_wifi: total_free_bytes (internal): 92091 I (3140) btd_wifi: total_free_bytes (SPIRAM): 7956076 I (3150) btd_wifi: total_free_bytes (DMA): 84295

Of course there are a lot of leaked bytes since the program halted because of the exception before any more bytes could be freed.

Seems to me there is plenty of RAM available for the program to run. I don't think the exception is caused by running out of RAM. That being said, it's not clear what happened between sending the log to monitor and the exception. Could there have been more leakage that would cause the program to run out of memory?

These memory-related problems are difficult to trouble shoot.

chegewara commented 1 year ago

Next step i would do is to heap trace only with wifi and only with lvgl.

hollingerc commented 1 year ago

Further investigation hasn't convinced me that there is a memory leak, although I could be wrong as I don't have time to pursue this issue more thoroughly. I needed to get my firmware working, and I finally succeeded. Here's what I did.

I simplified my testing firmware by creating a new project consisting only of these two IDF examples: /esp-idf/examples/peripherals/lcd/rgb_panel, and /esp-idf/examples/wifi/getting_started/station

The LCD example uses a chart example from LVGL.

I modified these slightly so that they would work as one project, took app_main() out of the WiFi code and added a call to start the WiFi process in ap_main() of the LCD code. I can provide the source files if needed.

I ran my tests again and got the same StoreProhibited exception. Further experimenting revealed that if I started the LVGL task after starting the WiFi process, the exception would not occur and the program ran correctly. This was not a solution for me though, as I need the user, through the GUI to be able to start the WiFi process at any time. Thus, LVGL needs to be running first.

I spent some time studying the IDF LCD documentation and saw that there were a couple of optional ways of using RAM. I tried implementing two frame buffers then tried implementing a bounce buffer as described in the documentation, but still the same exception occurred. I settled on the LCD using only one frame buffer in PSRAM (the buffer is way to big to put into internal RAM) and no bounce buffer.

LGVL needs one or two small (less than display sized) draw buffers. In the IDF LCD example, these were allocated in PSRAM using heap_caps_malloc(), so I tried allocating them in internal RAM (also using heap_caps_malloc). The same exception occurred.

I noticed in the LVGL documentation, the examples given simply declared these buffers as arrays of type lv_color_t. I tried this and the StoreProhibited exception no longer occurred and the program ran with LVGL running before the WiFi process started. I ported this change over to my code and it ran with no issues and was stable. The user can start WiFi by pressing buttons on the LCD. My code is now working as I expect it to.

My firmware is working now and I do not have the time to pursue this issue any further. I would say there is a bug in LVGL, and/or the LCD code, and/or the WiFi code. There seems to be an interaction between all three through the heap memory.

I'd like to keep the issue open to see if anyone can look into the problem and perhaps find a solution.

kapilkedawat commented 7 months ago

Hi @hollingerc, are you still facing this issue?

lduncan commented 7 months ago

I'm seeing an identical issue here too. LVGL and WiFi @hollingerc how did you alter the buffer allocations so WiFi worked?

Mine is essentially: lv_disp_buf1 = (lv_color_t *)heap_caps_malloc(LVGL_LCD_BUF_SIZE * sizeof(lv_color_t), MALLOC_CAP_DMA | MALLOC_CAP_INTERNAL);

kapilkedawat commented 3 weeks ago

@lduncan @hollingerc do let us know if you still think its a WiFi code issue and you guys are able to reproduce it.

lduncan commented 3 weeks ago

@kapilkedawat I trace my issue to user error 🤦‍♂️ I was using a ESP32-S3-WROOM-1U-N16R8 module which has an Octal PSRAM, which means pins IO35, IO36, and IO37 are not available for other uses.

I was attempting to use these pins for other purposes. Disabling the PSRAM completely solved the issue.