Open KaeLL opened 1 year ago
What is your WiFi signal strength (rssi)? Does the problem go away with better signal quality? We had occasional task WDT timeouts on bad signal quality. Not during OTA updates, but many other activities. Raising the timeout to 60 seconds is our workaround...
Interesting that you mentioned that because I'm doing bandwidth throttling tests, albeit on Ethernet. It doesn't look like it's directly network-related, though. Among other tests, this one involves lowering the watchdog timeout to 2 or 3s in order to catch potentially problematic blocking code, and this one came up. Another one that pops up here and there is MbedTLS's MPI code. It's not obviously clear to me whether these operations should indeed trigger a low timeout watchdog or if a workaround/circumvention is warranted.
We had occasional task WDT timeouts on bad signal quality. Not during OTA updates, but many other activities.
Could you post your debug log (backtrace) when the issue happened?
This was in 4.4.2 or 4.4.3. Only when the rssi was really bad. Raising the task WDT timeout to 60 seconds "solved" the problem.
0x4008211e: panic_abort at /esp-idf/components/esp_system/panic.c:402
0x4008b129: esp_system_abort at /esp-idf/components/esp_system/esp_system.c:128
0x40094c5a: abort at /esp-idf/components/newlib/abort.c:46
0x400f74bc: task_wdt_isr at /esp-idf/components/esp_system/task_wdt.c:176 (discriminator 3)
0x40082f6d: _xt_lowint1 at /esp-idf/components/freertos/port/xtensa/xtensa_vectors.S:1111
0x40084a63: spi_device_polling_start at /esp-idf/components/driver/spi_master.c:930 (discriminator 5)
0x40084bbc: spi_device_polling_transmit at /esp-idf/components/driver/spi_master.c:1000
0x400e469c: panel_io_spi_tx_color at /esp-idf/components/esp_lcd/src/esp_lcd_panel_io_spi.c:315
0x400e458e: esp_lcd_panel_io_tx_color at /esp-idf/components/esp_lcd/src/esp_lcd_panel_io.c:29 (discriminator 2)
0x400e4dfd: panel_st7789_draw_bitmap at /esp-idf/components/esp_lcd/src/esp_lcd_panel_st7789.c:194
0x400e50ea: esp_lcd_panel_draw_bitmap at /esp-idf/components/esp_lcd/src/esp_lcd_panel_ops.c:34 (discriminator 2)
0x4008211e: panic_abort at /esp-idf/components/esp_system/panic.c:402
0x4008b129: esp_system_abort at /esp-idf/components/esp_system/esp_system.c:128
0x40094c5a: abort at /esp-idf/components/newlib/abort.c:46
0x400f74bc: task_wdt_isr at /esp-idf/components/esp_system/task_wdt.c:176 (discriminator 3)
0x40082f6d: _xt_lowint1 at /esp-idf/components/freertos/port/xtensa/xtensa_vectors.S:1111
0x4000bfed: ?? ??:0
0x4008ebe6: vPortClearInterruptMaskFromISR at /esp-idf/components/freertos/port/xtensa/include/freertos/portmacro.h:571
(inlined by) vPortExitCritical at /esp-idf/components/freertos/port/xtensa/port.c:319
0x4008af39: vRingbufferReturnItem at /esp-idf/components/esp_ringbuf/ringbuf.c:1296
0x400f2045: uart_read_bytes at /esp-idf/components/driver/uart.c:1328
0x4008211e: panic_abort at esp-idf/components/esp_system/panic.c:402
0x4008b129: esp_system_abort at esp-idf/components/esp_system/esp_system.c:128
0x40094c5a: abort at esp-idf/components/newlib/abort.c:46
0x400f74bc: task_wdt_isr at esp-idf/components/esp_system/task_wdt.c:176 (discriminator 3)
0x40082f6d: _xt_lowint1 at esp-idf/components/freertos/port/xtensa/xtensa_vectors.S:1111
0x4000bfed: ?? ??:0
0x4008ebe6: vPortClearInterruptMaskFromISR at esp-idf/components/freertos/port/xtensa/include/freertos/portmacro.h:571
(inlined by) vPortExitCritical at esp-idf/components/freertos/port/xtensa/port.c:319
0x4008c101: xQueueGenericSend at esp-idf/components/freertos/queue.c:946
0x40092fe9: sys_mutex_unlock at esp-idf/components/lwip/port/esp32/freertos/sys_arch.c:96
0x40092ebc: sock_inc_used_locked at esp-idf/components/lwip/lwip/src/api/sockets.c:391
(inlined by) tryget_socket_unconn_locked at esp-idf/components/lwip/lwip/src/api/sockets.c:489
0x401693ad: lwip_select_inc_sockets_used_set at esp-idf/components/lwip/lwip/src/api/sockets.c:1977
0x40092907: lwip_select_inc_sockets_used_set at esp-idf/components/lwip/lwip/src/api/sockets.c:1970
(inlined by) lwip_select_inc_sockets_used at esp-idf/components/lwip/lwip/src/api/sockets.c:1997
(inlined by) lwip_select at esp-idf/components/lwip/lwip/src/api/sockets.c:2051
0x40107b55: select at esp-idf/components/lwip/port/esp32/no_vfs_syscalls.c:77
0x4010e4d4: httpd_server at esp-idf/components/esp_http_server/src/httpd_main.c:205
(inlined by) httpd_thread at esp-idf/components/esp_http_server/src/httpd_main.c:250
What do you pass as image_size
to esp_ota_begin
? Deleting the entire partition up-front (e.g. by passing OTA_SIZE_UNKNOWN
) can lead to quite some delays which might trigger the watchdog.
Hello @KaeLL
Erasing the whole partition at once can be the cause of the watchdog getting triggered. There are two ways you can try to solve this issue:
1) Setting bulk_flash_erase
in esp_https_ota_config_t
structure as false.
2) Increasing the watchdog timeout before starting the OTA process, and the resetting timeout to its original value once the OTA process is complete.
Please let me know if these suggestions helps you solve the issue. Thanks!
@hmalpani What's your comment about this issue: https://github.com/espressif/esp-idf/issues/6096
Hello @KaeLL Erasing the whole partition at once can be the cause of the watchdog getting triggered. There are two ways you can try to solve this issue:
- Setting
bulk_flash_erase
inesp_https_ota_config_t
structure as false.
In my case, the bulk_flash_erase is false. BTW, if you confirm bulk_flash_erase=true can cause this issue, this needs fix. (Why do you provide a config option that can cause watchdog timeout?)
- Increasing the watchdog timeout before starting the OTA process, and the resetting timeout to its original value once the OTA process is complete.
The purpose of watchdog timeout is to prevent any task occupies CPU for a long period which is likely to be a mis-behavior or bug. But now you ask increasing the watchdog timeout rather than try to figure out why the OTA process can cause watchdog timeout? This seems does not make sense.
@AxelLin is right. Acting as if the flash and watchdog components are this unaware of each other doesn't really make sense. What's weird to me is that spi_flash
already accounts for long erases by optionally yielding, but not writes. Forgive my ignorance, but why?
What's weird to me is that spi_flash already accounts for long erases by optionally yielding, but not writes. Forgive my ignorance, but why?
When this option (yield during an erase operations) was added, we have considered that the in the typical usage of spi_flash API, erase and programming operations are alternating, and since erase operations take much longer it is sufficient to yield only when erasing. We didn't consider usage such as when the OTA process performs a bulk erase and then the bulk write.
However it should be noted that during the OTA, TCP/IP and Wi-Fi tasks may also be active, and with sufficient network throughput it might not be sufficient to yield in the task calling spi_flash API only. (While spi_flash_write yields, other high priority tasks may run, and the idle task won't get a chance to run.)
In the end, it becomes a system level decision: if the whole system is busy enough that the idle task doesn't get a chance to run, is the idle task watchdog triggering still the expected behavior? I think it is probably a simpler option to assume such behavior than adding a "yield occasionally" type of setting to every component which may occupy a lot of CPU time.
However, we don't illustrate the way to deal with this problem anywhere in the examples or the docs, and it's definitely something to improve.
All that said, I don't think the original issue reported by @KaeLL is the "real" task watchdog timeout (idle task starvation) which could be treated by yielding:
E (12000) task_wdt: CPU 0: ipc0
E (12000) task_wdt: CPU 1: IDLE
looks like some race condition in spi_flash driver has caused us to get stuck in the ipc task forever. If this was a real watchdog timeout, one of the cores would be running the OTA task and the other one the IPC task.
(Some of the other backtraces posted in subsequent comments do point to the idle task being starved by network/flash operations, so that point is also valid. It's probably better to split the two types of failures into separate issues, though.)
All that said, I don't think the original issue reported by @KaeLL is the "real" task watchdog timeout (idle task starvation) which could be treated by yielding:
E (12000) task_wdt: CPU 0: ipc0 E (12000) task_wdt: CPU 1: IDLE
Similar issues: https://github.com/espressif/esp-idf/issues/7335#issuecomment-1044454282 https://github.com/espressif/esp-idf/issues/4984#issuecomment-1156445364
we don't illustrate the way to deal with this problem anywhere in the examples or the docs and it's definitely something to improve
👍 Which way is that, btw? Edit:
looks like some race condition in spi_flash driver has caused us to get stuck in the ipc task forever
Yey, my favorite type of bug
Hi @KaeLL! Could you try this fix and give feedback if it helps? 28515.zip This fix helps in cases when an unpinned to a certain CPU task runs a flash operation and gets stuck due to the IPC task not call the callback function which the flash task is waiting for.
@KonstantinKondrashov I just tested with and without the patch, and I'm not sure about the results. On one hand, I got the error again, with the same backtrace, both with and without the patch. On the other, considering igrr's description of what it would look like if this was a real watchdog timeout, it's different this time since the APP_CPU is running the LwIP task. With that said, it's worth remembering that the error goes away if I set the watchdog timeout to something greater than 2 seconds.
Again, here's the defconfig
@KonstantinKondrashov Any follow up regarding the feedback in https://github.com/espressif/esp-idf/issues/10890#issuecomment-1908810272 ?
Answers checklist.
IDF version.
release/v4.4
Operating System used.
Linux
How did you build your project?
Command line with idf.py
If you are using Windows, please specify command line type.
None
Development Kit.
Custom Board
Power Supply used.
External 5V
What is the expected behavior?
Watchdog not trigger.
What is the actual behavior?
Watchdog triggers on IPC task after
esp_ota_begin
Steps to reproduce.
defconfig
``` CONFIG_BOOTLOADER_LOG_LEVEL_WARN=y CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE=y CONFIG_ESPTOOLPY_FLASHMODE_QIO=y CONFIG_ESPTOOLPY_FLASHFREQ_80M=y CONFIG_ESPTOOLPY_FLASHSIZE_4MB=y # CONFIG_ESPTOOLPY_FLASHSIZE_DETECT is not set CONFIG_ESPTOOLPY_AFTER_NORESET=y CONFIG_ESPTOOLPY_MONITOR_BAUD_CONSOLE=y CONFIG_PARTITION_TABLE_CUSTOM=y CONFIG_PARTITION_TABLE_CUSTOM_FILENAME="partitions_ota_encrypted.csv" CONFIG_PARTITION_TABLE_OFFSET=0xC000 CONFIG_BT_ENABLED=y CONFIG_BTDM_CTRL_BLE_MAX_CONN=1 # CONFIG_BTDM_CTRL_MODEM_SLEEP is not set # CONFIG_BTDM_BLE_SCAN_DUPL is not set # CONFIG_BTDM_BLE_ADV_REPORT_FLOW_CTRL_SUPP is not set # CONFIG_BT_BLE_SMP_ENABLE is not set CONFIG_BT_STACK_NO_LOG=y CONFIG_BT_ACL_CONNECTIONS=1 # CONFIG_BT_MULTI_CONNECTION_ENBALE is not set CONFIG_BT_BLE_DYNAMIC_ENV_MEMORY=y CONFIG_BT_BLE_HOST_QUEUE_CONG_CHECK=y CONFIG_SPI_MASTER_IN_IRAM=y # CONFIG_SPI_SLAVE_ISR_IN_IRAM is not set CONFIG_ESP32_REV_MIN_1=y CONFIG_ESP32_DEFAULT_CPU_FREQ_240=y CONFIG_ESP32_ULP_COPROC_ENABLED=y CONFIG_ESP32_ULP_COPROC_RESERVE_MEM=128 # CONFIG_ESP32_DEBUG_OCDAWARE is not set CONFIG_ESP32_RTC_CLK_SRC_INT_8MD256=y # CONFIG_ESP_ERR_TO_NAME_LOOKUP is not set CONFIG_ETH_DMA_BUFFER_SIZE=256 CONFIG_ETH_DMA_RX_BUFFER_NUM=3 CONFIG_ETH_DMA_TX_BUFFER_NUM=3 # CONFIG_ETH_USE_SPI_ETHERNET is not set # CONFIG_ESP_EVENT_POST_FROM_ISR is not set CONFIG_ESP_HTTP_CLIENT_ENABLE_BASIC_AUTH=y # CONFIG_ESP_HTTP_CLIENT_ENABLE_DIGEST_AUTH is not set CONFIG_HTTPD_MAX_REQ_HDR_LEN=2048 CONFIG_ESP_IPC_TASK_STACK_SIZE=2048 # CONFIG_ESP_PHY_CALIBRATION_AND_DATA_STORAGE is not set CONFIG_ESP_MAIN_TASK_STACK_SIZE=10240 CONFIG_ESP_CONSOLE_UART_CUSTOM=y CONFIG_ESP_CONSOLE_UART_BAUDRATE=921600 CONFIG_ESP_INT_WDT_TIMEOUT_MS=30 CONFIG_ESP_TASK_WDT_PANIC=y CONFIG_ESP_TASK_WDT_TIMEOUT_S=2 CONFIG_ESP_PANIC_HANDLER_IRAM=y # CONFIG_ESP32_WIFI_NVS_ENABLED is not set # CONFIG_ESP32_WIFI_IRAM_OPT is not set # CONFIG_ESP32_WIFI_RX_IRAM_OPT is not set CONFIG_ESP_COREDUMP_ENABLE_TO_UART=y CONFIG_ESP_COREDUMP_CHECKSUM_SHA256=y CONFIG_ESP_COREDUMP_MAX_TASKS_NUM=30 CONFIG_ESP_COREDUMP_STACK_SIZE=1024 CONFIG_FREERTOS_CORETIMER_1=y CONFIG_FREERTOS_HZ=200 # CONFIG_FREERTOS_ASSERT_ON_UNTESTED_FUNCTION is not set CONFIG_FREERTOS_WATCHPOINT_END_OF_STACK=y CONFIG_HEAP_ABORT_WHEN_ALLOCATION_FAILS=y CONFIG_LOG_DEFAULT_LEVEL_DEBUG=y CONFIG_LOG_TIMESTAMP_SOURCE_SYSTEM=y CONFIG_LWIP_TCPIP_TASK_AFFINITY_CPU0=y CONFIG_MBEDTLS_CERTIFICATE_BUNDLE_DEFAULT_CMN=y # CONFIG_MBEDTLS_HARDWARE_MPI is not set CONFIG_MBEDTLS_TLS_CLIENT_ONLY=y # CONFIG_MQTT_TRANSPORT_WEBSOCKET is not set CONFIG_MQTT_SKIP_PUBLISH_IF_DISCONNECTED=y CONFIG_MQTT_TASK_CORE_SELECTION_ENABLED=y CONFIG_NVS_ASSERT_ERROR_CHECK=y CONFIG_EMAC_RECV_TASK_CORE=0 ```Debug Logs.