CarlosDerSeher / snapclient

snapclient on ESP32
GNU General Public License v3.0
127 stars 16 forks source link

Stack overflow during init #95

Closed aliask closed 2 months ago

aliask commented 2 months ago

I'm running into a intermittent issue where the STM32 hits a stack overflow during initialisation.

For example:

I (98789) main_task: Returned from app_main()
I (98790) SC: try connecting to static configuration 10.20.0.10:1704
I (99163) SC: netconn connected
I (99165) SC: netconn sent hello message
I (99387) SC: Buffer length:  1000
I (99387) SC: Latency:        0
I (99387) SC: Mute:           0
I (99388) SC: Setting volume: 100
I (99392) dspProc: Set volume to 1.000000
I (99397) dspProc: Set volume to 1.000000
I (99438) SC: fLaC sampleformat: 44100:16:2

***ERROR*** A stack overflow in task flac_decoder_ta has been detected.

Backtrace: 0x40375c0a:0x3fcd74d0 0x4037ca51:0x3fcd74f0 0x4037f6fa:0x3fcd7510 0x4037e2e9:0x3fcd7590 0x4037f808:0x3fcd75b0 0x4037f7fe:0x3fcd76a0 0x3fcd76bd:0x3fcd76e0 |<-CORRUPTED
0x40375c0a: panic_abort at /home/aliask/esp/v5.1.3/esp-idf/components/esp_system/panic.c:472
0x4037ca51: esp_system_abort at /home/aliask/esp/v5.1.3/esp-idf/components/esp_system/port/esp_system_chip.c:84
0x4037f6fa: vApplicationStackOverflowHook at /home/aliask/esp/v5.1.3/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:581
0x4037e2e9: vTaskSwitchContext at /home/aliask/esp/v5.1.3/esp-idf/components/freertos/FreeRTOS-Kernel/tasks.c:3841
0x4037f808: _frxt_dispatch at /home/aliask/esp/v5.1.3/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/portasm.S:450
0x4037f7fe: _frxt_int_exit at /home/aliask/esp/v5.1.3/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/portasm.S:245

In this case, it was in the flac_decoder_task, however that's not always the case:

[... same boot sequence messages to above ...]
I (337862) dspProc: Set volume to 1.000000
I (337866) dspProc: Set volume to 1.000000

***ERROR*** A stack overflow in task IDLE0 has been detected.
[... same stack trace to above ...]

After the crash the device restarts and usually successfully allocates the memory.

My configured stack sizes seem to be the same or larger than the sample configurations, so I'm not sure what's causing this.

My device has 8MB of PSRAM configured, so I have lots of room to play with.

Stack-related configuration values ``` CONFIG_ESP_SYSTEM_EVENT_TASK_STACK_SIZE=2304 CONFIG_ESP_MAIN_TASK_STACK_SIZE=3584 CONFIG_ESP_MINIMAL_SHARED_STACK_SIZE=2048 CONFIG_ESP_IPC_TASK_STACK_SIZE=1280 CONFIG_ESP_TIMER_TASK_STACK_SIZE=3584 CONFIG_FREERTOS_TIMER_TASK_STACK_DEPTH=2048 CONFIG_LWIP_TCPIP_TASK_STACK_SIZE=3072 CONFIG_PTHREAD_TASK_STACK_SIZE_DEFAULT=3072 CONFIG_PTHREAD_STACK_MIN=768 CONFIG_WEBSOCKET_SERVER_TASK_STACK_DEPTH=6000 CONFIG_MDNS_TASK_STACK_SIZE=4096 CONFIG_SYSTEM_EVENT_TASK_STACK_SIZE=2304 CONFIG_MAIN_TASK_STACK_SIZE=3584 CONFIG_IPC_TASK_STACK_SIZE=1280 CONFIG_TIMER_TASK_STACK_SIZE=3584 CONFIG_TIMER_TASK_STACK_DEPTH=2048 CONFIG_TCPIP_TASK_STACK_SIZE=3072 CONFIG_ESP32_PTHREAD_TASK_STACK_SIZE_DEFAULT=3072 CONFIG_ESP32_PTHREAD_STACK_MIN=768 ```
Full log output when it works ``` I (0) cpu_start: App cpu up. I (1152) esp_psram: SPI SRAM memory test OK I (1162) cpu_start: Pro cpu start user code I (1162) cpu_start: cpu freq: 240000000 Hz I (1162) cpu_start: Application information: I (1165) cpu_start: Project name: snapclient I (1170) cpu_start: App version: ESP_IDF_v4.3.5-15-ge4fc740-dirt I (1177) cpu_start: Compile time: Jul 28 2024 10:26:29 I (1183) cpu_start: ELF file SHA256: 7c09adde6f5ffb35... I (1190) cpu_start: ESP-IDF: v5.1.3-dirty I (1195) cpu_start: Min chip rev: v0.0 I (1200) cpu_start: Max chip rev: v0.99 I (1205) cpu_start: Chip rev: v0.2 I (1210) heap_init: Initializing. RAM available for dynamic allocation: I (1217) heap_init: At 3FCA2670 len 000470A0 (284 KiB): DRAM I (1223) heap_init: At 3FCE9710 len 00005724 (21 KiB): STACK/DRAM I (1230) heap_init: At 3FCF0000 len 00008000 (32 KiB): DRAM I (1236) heap_init: At 600FE010 len 00001FD8 (7 KiB): RTCRAM I (1243) esp_psram: Adding pool of 8192K of PSRAM memory to heap allocator I (1251) spi_flash: detected chip: winbond I (1255) spi_flash: flash io: dio I (1259) sleep: Configure to isolate all GPIO pins in sleep state I (1266) sleep: Enable automatic switching of GPIO sleep configuration I (1273) app_start: Starting scheduler on CPU0 I (1278) app_start: Starting scheduler on CPU1 I (1278) main_task: Started on CPU0 I (1288) esp_psram: Reserving pool of 32K of internal memory for DMA/internal allocations I (1296) main_task: Calling app_main() I (1314) SC: Start codec chip I (1315) HAL: INIT I (1315) HAL: codec_hal done I (1315) AUDIO_BOARD: board-handle done I (1318) SC: Audio board_init done I (1322) AUDIO_HAL: Codec mode is 2, Ctrl:1 I (1327) SC: init player I (1330) PLAYER: enable initial sync timer I (1335) PLAYER: Start player_task I (1339) PLAYER: init player done I (1339) PLAYER: started sync task I (1344) pp: pp rom version: e7ae62f I (1351) net80211: net80211 rom version: e7ae62f I (1378) phy_init: phy_version 640,cd64a1a,Jan 24 2024,17:28:12 I (1417) WIFI: wifi_init_sta finished. I (5665) WIFI: retry to connect to the AP I (5666) WIFI: connect to the AP fail I (10638) WIFI: retry to connect to the AP I (10639) WIFI: connect to the AP fail I (13046) WIFI: retry to connect to the AP I (13047) WIFI: connect to the AP fail I (17297) WIFI: retry to connect to the AP I (17298) WIFI: connect to the AP fail I (18808) WIFI: retry to connect to the AP I (18808) WIFI: connect to the AP fail I (21216) WIFI: retry to connect to the AP I (21216) WIFI: connect to the AP fail I (25529) WIFI: retry to connect to the AP I (25530) WIFI: connect to the AP fail I (30505) WIFI: retry to connect to the AP I (30505) WIFI: connect to the AP fail I (32913) WIFI: retry to connect to the AP I (32914) WIFI: connect to the AP fail I (37161) WIFI: retry to connect to the AP I (37161) WIFI: connect to the AP fail I (39569) WIFI: retry to connect to the AP I (39569) WIFI: connect to the AP fail I (43819) WIFI: retry to connect to the AP I (43819) WIFI: connect to the AP fail I (46227) WIFI: retry to connect to the AP I (46227) WIFI: connect to the AP fail I (50476) WIFI: retry to connect to the AP I (50476) WIFI: connect to the AP fail I (53235) WIFI: retry to connect to the AP I (53236) WIFI: connect to the AP fail I (55644) WIFI: retry to connect to the AP I (55644) WIFI: connect to the AP fail I (60384) WIFI: retry to connect to the AP I (60384) WIFI: connect to the AP fail I (65424) WIFI: retry to connect to the AP I (65424) WIFI: connect to the AP fail I (67832) WIFI: retry to connect to the AP I (67832) WIFI: connect to the AP fail I (72086) WIFI: retry to connect to the AP I (72086) WIFI: connect to the AP fail I (74857) WIFI: retry to connect to the AP I (74858) WIFI: connect to the AP fail I (77265) WIFI: retry to connect to the AP I (77265) WIFI: connect to the AP fail I (82321) WIFI: retry to connect to the AP I (82321) WIFI: connect to the AP fail I (87791) WIFI: retry to connect to the AP I (87792) WIFI: connect to the AP fail I (90199) WIFI: retry to connect to the AP I (90199) WIFI: connect to the AP fail I (94448) WIFI: retry to connect to the AP I (94448) WIFI: connect to the AP fail I (99423) WIFI: retry to connect to the AP I (99424) WIFI: connect to the AP fail I (101831) WIFI: retry to connect to the AP I (101832) WIFI: connect to the AP fail I (106083) WIFI: retry to connect to the AP I (106084) WIFI: connect to the AP fail I (110883) WIFI: retry to connect to the AP I (110883) WIFI: connect to the AP fail I (113291) WIFI: retry to connect to the AP I (113292) WIFI: connect to the AP fail I (118265) WIFI: retry to connect to the AP I (118265) WIFI: connect to the AP fail I (123065) WIFI: retry to connect to the AP I (123065) WIFI: connect to the AP fail I (125473) WIFI: retry to connect to the AP I (125474) WIFI: connect to the AP fail I (129775) WIFI: retry to connect to the AP I (129775) WIFI: connect to the AP fail I (132034) WIFI: retry to connect to the AP I (132035) WIFI: connect to the AP fail I (134442) WIFI: retry to connect to the AP I (134442) WIFI: connect to the AP fail I (138746) WIFI: retry to connect to the AP I (138746) WIFI: connect to the AP fail I (144216) WIFI: retry to connect to the AP I (144216) WIFI: connect to the AP fail I (146624) WIFI: retry to connect to the AP I (146624) WIFI: connect to the AP fail I (148157) WIFI: retry to connect to the AP I (148157) WIFI: connect to the AP fail I (151443) WIFI: retry to connect to the AP I (151443) WIFI: connect to the AP fail I (153850) WIFI: retry to connect to the AP I (153851) WIFI: connect to the AP fail I (158099) WIFI: retry to connect to the AP I (158100) WIFI: connect to the AP fail I (160343) WIFI: retry to connect to the AP I (160344) WIFI: connect to the AP fail I (162751) WIFI: retry to connect to the AP I (162751) WIFI: connect to the AP fail I (164590) WIFI: Connected with IP Address:10.20.0.110 I (164591) esp_netif_handlers: sta ip: 10.20.0.110, mask: 255.255.255.0, gw: 10.20.0.1 I (164594) WIFI: connected to ap SSID: XXXXXXXXXXXXX I (164600) SC: Connected to AP I (164603) HTTP: Initializing SPIFFS I (164640) HTTP: Partition size: total: 534881, used: 50200 I (164641) HTTP: Mount /html to storage success I (164641) HTTP: d_name=/html/index.html d_ino=0 d_type=1 I (164648) HTTP: d_name=/html/ESP-LOGO.txt d_ino=0 d_type=1 I (164654) HTTP: d_name=/html/favicon.ico d_ino=0 d_type=1 I (164673) HTTP: Start http task=10.20.0.110 I (164673) HTTP: Starting server on 10.20.0.110:8000 I (164674) HTTP: Starting HTTP Server on port: '8000' I (164679) NETF: Setup mdns I (164694) NETF: Initializing SNTP I (164694) NETF: Waiting for system time to be set... (1/10) I (166695) NETF: The current date/time in UTC is: Sun Jul 28 00:38:47 2024 I (166696) dspProc: dsp_processor_init: init done I (166698) OTA: idf.py build ; curl snapclient.local:8032 --data-binary @- < build/snapclient.bin I (166707) main_task: Returned from app_main() I (166707) SC: try connecting to static configuration 10.20.0.10:1704 I (166778) SC: netconn connected I (166780) SC: netconn sent hello message I (168777) SC: Buffer length: 1000 I (168778) SC: Latency: 0 I (168778) SC: Mute: 0 I (168779) SC: Setting volume: 100 I (168783) dspProc: Set volume to 1.000000 I (168788) dspProc: Set volume to 1.000000 I (168909) SC: fLaC sampleformat: 44100:16:2 I (169595) PLAYER: player_setup_i2s: dma_buf_len is 576, dma_buf_count is 4 W (169597) PLAYER: no pcm chunk queue created? I (169597) PLAYER: created new queue with 37 I (169602) PLAYER: snapserver config changed, buffer 1000ms, chunk 1152 frames, sample rate 44100, ch 2, bits 16 mute 0 latency 0 ```
CarlosDerSeher commented 2 months ago

I have to say, I am in the middle of dropping the decoder task completely and moving the decoding part to http task but didn't have time to finish this yet. I did this for flac but not the other decoders.

I am doing this because of #54 but haven't found a working solution yet.

I could push these changes to a new branch for you to play with as I don't think it would be usefull to do much work on this in master.

aliask commented 2 months ago

OK, seems like there is some relationship here. I changed my stream to PCM@20ms and I don't see the crash any more.

However, still no audio - buffer semaphore never given, repeats W (34558) PLAYER: latency_buffer_full: can't take semaphore until the WDT kills process.

Unscientifically, I just increased the size of the FLAC task's stack from 9 * 256 to 12 * 256 and this also fixes the crash. However again, no audio:

I (7098) SC: netconn sent hello message
I (7108) SC: Buffer length:  1000
I (7108) SC: Latency:        0
I (7108) SC: Mute:           0
I (7108) SC: Setting volume: 100
I (7118) SC: fLaC sampleformat: 44100:16:2
I (7268) PLAYER: player_setup_i2s: dma_buf_len is 576, dma_buf_count is 4
W (7268) PLAYER: no pcm chunk queue created?
I (7268) PLAYER: created new queue with 37
I (7278) PLAYER: snapserver config changed, buffer 1000ms, chunk 1152 frames, sample rate 44100, ch 2, bits 16 mute 0 latency 0
I (10568) SC: latency buffer full
W (10578) PLAYER: RESYNCING HARD 1: age 693193us, latency 1203663951566us, free 8496531, largest block 8257536, rssi: -73
E (12578) PLAYER: Couldn't get PCM chunk, recv: messages waiting 0, diff2Server: 1203663s, 951.566ms
E (14578) PLAYER: Couldn't get PCM chunk, recv: messages waiting 0, diff2Server: 1203663s, 951.566ms
W (14798) PLAYER: RESYNCING HARD 1: age 4890606us, latency 1203663951566us, free 8499871, largest block 8257536, rssi: -70
W (14988) PLAYER: RESYNCING HARD 1: age 5042574us, latency 1203663941965us, free 8501207, largest block 8257536, rssi: -70

What settings do you recommend trying to get this software to work?

CarlosDerSeher commented 2 months ago

Did you double check your gpio config for i2s?

aliask commented 2 months ago

Short version

I have FLAC "working" @ 26ms, with a strange bug which is probably unrelated to the stack overflow. So it seems that increasing the hard-coded stack size is enough to fix this problem. Going from 9 to 10 is sufficient in my tests.

Long version

I have two STM32 S3 devboards, and out of frustration with WiFi stability I switched to the other board to test if it has the same behaviour. This new board connects to WiFi much more reliably, and while I was switching things over I also converted the DAC connection from floating jumper wires to using a breadboard.

After switching board I started getting choppy audio output! Progress! There is no debug output to suggest the source of this problem, apart from some uncorrelated RESYNCING HARD 1/2 messages. However, when I picked up the board the audio suddenly became clear.

I'm unsure if this impedance change fixes something to do with the I2S signals or something else entirely - but it seems that my hardware configuration is the source of some of my problems. Here's a recording of the audio while I place my finger on some unrelated GPIO pins: stm32s3_audio.mp3.gz. RSSI doesn't change when I do this, so I guess it's not WiFi.

CarlosDerSeher commented 2 months ago

Maybe a broken wire somewhere?

Which DAC?

aliask commented 2 months ago

I think I've figured it out. I'm using a PCM5102A DAC, and when you're not supplying SCK (which I'm not) then you have to ground it. I had left it floating, so I've fixed that up and the stream is a lot more reliable.

I'm still getting significant artifacts (e.g. briefly right after the client connects) and unreliable operation, but I might continue to debug & open a separate issue if I'm not able to figure things out.

Would you like me top open a PR to increase the FLAC stack size? How was the original value calculated?

CarlosDerSeher commented 2 months ago

Would you like me top open a PR to increase the FLAC stack size? How was the original value calculated?

Trial and error. Yes feel free to open a PR