Open neorevx opened 4 years ago
@neorevx Thanks for detailed reports, we will look into.
There are apparently no problems with the firmware in the SPI memory. After manual reboot (pin EN) it get working. Checking the logs again, he started the program from the factory partition. Only now that I noticed that it is the same failure!
(Cache disabled but cached memory region accessed)
Problem is cached disabled and soft restart.
I don't understand how the system manages SPI memory when writing the OTA partition. Certainly there must be techniques to manage the reading of the code of the running program and the writing of the new binary.
There may be a problem with the system cache and reading the SPI during this step. This also explains why other tasks get slower.
System cache is disabled...
My UART interruption is in IRAM! The uart_pattern_enqueue
function too! It may be related to IRAM.
I believe that you can add a check in the bootloader by default to activate the cache. So the second failure will not happen, even if the first one does.
More one thing: 0x400855e8: spi_flash_op_block_func at D:/OneDrive/ESP32-Master/components/spi_flash/cache_utils.c:103 (discriminator 1)
The stack for UART related interrupt come from spi_flash component.
Another thing not related direct to this issue: pattern det don't work if you compile optimized for performance. Maybe are missing some "volatile" keyword for some vars in UART driver, I don't know.
I read some docs over the internet and concluded:
If I disabled IRAM code for SPI, the application will runs slowly? (due reading code from SPI). Or it will impact only writing?
Flash writing operation will disable the cache, the code and data in cache Inaccessible. So, ESP_EARLY_LOGW(UART_TAG, "Fail to enqueue pattern position, pattern queue is full.");
may cause a crash. We will think of a solution to this issue. Currently you can avoid crashes by disable ESP_LOG_WARN
level of printing.
thanks!!.
any update here?
I might have hit this or a similar error as my stacktrace is: ` E (117340) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time: E (117340) task_wdt: - IDLE (CPU 0) E (117340) task_wdt: Tasks currently running: E (117340) task_wdt: CPU 0: ipc0 E (117340) task_wdt: CPU 1: IDLE E (117340) task_wdt: Aborting.
abort() was called at PC 0x40129cbc on core 0 0x40129cbc: task_wdt_isr at /home/dbmtc/esp/esp-idf/components/esp_common/src/task_wdt.c:182 (discriminator 1)
Backtrace:0x40081882:0x3ffb0980 0x400892b9:0x3ffb09a0 0x4008f8f2:0x3ffb09c0 0x40129cbc:0x3ffb0a30 0x400829c5:0x3ffb0a50 0x40084768:0x3ffaf8a0 0x400834f6:0x3ffaf8c0 0x40083448:0x3ffaf8e0 0x40081882: panic_abort at /home/dbmtc/esp/esp-idf/components/esp_system/panic.c:404
0x400892b9: esp_system_abort at /home/dbmtc/esp/esp-idf/components/esp_system/system_api.c:112
0x4008f8f2: abort at /home/dbmtc/esp/esp-idf/components/newlib/abort.c:46
0x40129cbc: task_wdt_isr at /home/dbmtc/esp/esp-idf/components/esp_common/src/task_wdt.c:182 (discriminator 1)
0x400829c5: _xt_lowint1 at /home/dbmtc/esp/esp-idf/components/freertos/port/xtensa/xtensa_vectors.S:1105
0x40084768: xt_int_enable_mask at /home/dbmtc/esp/esp-idf/components/xtensa/include/xtensa/xtensa_api.h:170 (inlined by) intr_cntrl_ll_enable_int_mask at /home/dbmtc/esp/esp-idf/components/hal/esp32/include/hal/interrupt_controller_ll.h:100 (inlined by) interrupt_controller_hal_enable_int_mask at /home/dbmtc/esp/esp-idf/components/hal/include/hal/interrupt_controller_hal.h:192 (inlined by) esp_intr_noniram_enable at /home/dbmtc/esp/esp-idf/components/esp_system/intr_alloc.c:815
0x400834f6: spi_flash_op_block_func at /home/dbmtc/esp/esp-idf/components/spi_flash/cache_utils.c:124
0x40083448: ipc_task at /home/dbmtc/esp/esp-idf/components/esp_ipc/ipc.c:74 `
my wild guess: as the task ipc0 is very high priority it might take a long time to execute without ever resetting the watchdog as in cache_utils.c this code can be spot:
while (!s_flash_op_can_start) { // Busy loop and wait for spi_flash_op_block_func to disable cache // on the other CPU }
indeed if this takes too much time that crash is very reasonable.
if a poll on a bool is the right thing to do (I don't even think a bool write is atomic enough) I suggest to either reset the watchdog or add a delay (hoping IDLE tak will call wdt_reset in time).
I am also using UART while updating, but I think the relation between those two things is a secondary side effect of a busy wait.
Environment
Problem Description + Debug Logs
The use of OTA has a negative impact on UART performance. Maybe it's just excessive CPU usage when writing SPI, or there may be a problem with silicon, I don't know.
Context: In my project I use all UART peripherals: 0-console, 1-GPS module, 2-GSM modem (sim800L). The GPS and the GSM modem use the hardware pattern det feature. The system also has an OTA update, normally running when the GSM modem is connected in PPP mode (without pattern det). When the update starts (OTA), the UART driver is unable to read all content received by the UART and the system generates "hw fifo full".
Okay, this is a problem, but of lesser impact. Until some previous versions of ESP-IDF (example 3.2), only few messages was displayed and the system recovered soon (when the OTA normalizes). However, in version 4.2dev, the pattern det feature generates a critical failure. Look:
It's may related to issue https://github.com/espressif/esp-idf/issues/4412. The error is log line: https://github.com/espressif/esp-idf/blob/6330b3345e87eb4401e7be7c8b6fea2870c35d9f/components/driver/uart.c#L347
Since this log use ets_printf, it should not be an problem.
For some even worse reason, OTA is able to write a few bytes to the OTA partition and at the next boot (automatic restart after failure) the system starts on the "partial" OTA partition or OTA may corrupt the factory partition. And it doesn't start anymore. Until then, in my tests, the OTA validation was correct. But now it seems that it is no longer, it starts a partition with invalid checksum. Maybe OTA updater corrupted factory partition. Failed to start:
I'm not even sure if there is a valid binary when the OTA updater starts. That is, it could be that http sent error content in html format and was saved as a binary app.
I believe the most important here are:
Expected Behavior
Steps to reproduce
Use NMEA with OTA examples together. UART impact may occur without ppp.