Write main SPI memory (OTA) impacting UART, now generate fatal error in UART pattern det and invalid OTA (IDFGH-2951)

neorevx commented 4 years ago

Environment

Development Kit: Custom
Kit version: Custom
Module or chip used: ESP32-WROOM-32U
IDF version: v4.2-dev-792-g6330b3345 (least today)
Build System: idf.py
Compiler version: xtensa-esp32-elf-gcc (crosstool-NG esp-2019r2) 8.2.0
Operating System: Windows
(Windows only) environment type: ESP Command Prompt
Using an IDE?: VS Code
Power Supply: Custom 3.3v

Problem Description + Debug Logs

The use of OTA has a negative impact on UART performance. Maybe it's just excessive CPU usage when writing SPI, or there may be a problem with silicon, I don't know.

Context: In my project I use all UART peripherals: 0-console, 1-GPS module, 2-GSM modem (sim800L). The GPS and the GSM modem use the hardware pattern det feature. The system also has an OTA update, normally running when the GSM modem is connected in PPP mode (without pattern det). When the update starts (OTA), the UART driver is unable to read all content received by the UART and the system generates "hw fifo full".

Okay, this is a problem, but of lesser impact. Until some previous versions of ESP-IDF (example 3.2), only few messages was displayed and the system recovered soon (when the OTA normalizes). However, in version 4.2dev, the pattern det feature generates a critical failure. Look:

Guru Meditation Error: Core  0 panic'ed (Cache disabled but cached memory region accessed).

Core  0 register dump:
PC      : 0x40008150  PS      : 0x00060034  A0      : 0x8008758b  A1      : 0x3ffb1710
A2      : 0x3f420e08  A3      : 0x00000000  A4      : 0x00000004  A5      : 0x00000000
A6      : 0x00000000  A7      : 0x00000000  A8      : 0x3ffb51f4  A9      : 0x3ffb1760
A10     : 0x000000ad  A11     : 0x00060023  A12     : 0x00060021  A13     : 0x3ffcc92e
A14     : 0x00000a0d  A15     : 0x36372a38  SAR     : 0x00000000  EXCCAUSE: 0x00000007
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0x00000000

Backtrace:0x4000814d:0x3ffb1710 0x40087588:0x3ffb17a0 0x40087942:0x3ffb17c0 0x40082c61:0x3ffb1810 0x400855e8:0x3ffbf410 0x400824fd:0x3ffbf430 0x4009a995:0x3ffbf450
0x40087588: uart_pattern_enqueue at D:/OneDrive/ESP32-Master/components/driver/uart.c:347 (discriminator 1)

0x40087942: uart_rx_intr_handler_default at D:/OneDrive/ESP32-Master/components/driver/uart.c:866

0x40082c61: _xt_lowint1 at D:/OneDrive/ESP32-Master/components/freertos/xtensa/xtensa_vectors.S:1105

0x400855e8: spi_flash_op_block_func at D:/OneDrive/ESP32-Master/components/spi_flash/cache_utils.c:103 (discriminator 1)

0x400824fd: ipc_task at D:/OneDrive/ESP32-Master/components/esp_common/src/ipc.c:62

0x4009a995: vPortTaskWrapper at D:/OneDrive/ESP32-Master/components/freertos/xtensa/port.c:143

Core  1 register dump:
PC      : 0x4000c056  PS      : 0x00060134  A0      : 0x80008547  A1      : 0x3ffe32a0
A2      : 0x8ae742f9  A3      : 0x3ffe3270  A4      : 0x00000001  A5      : 0x3ffe32b0
A6      : 0x00000003  A7      : 0x00000000  A8      : 0x00000000  A9      : 0xfc000000
A10     : 0xb0000000  A11     : 0x00000000  A12     : 0x00000001  A13     : 0x3ffe3290
A14     : 0x00000002  A15     : 0x00000000  SAR     : 0x00000020  EXCCAUSE: 0x00000007
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff

Backtrace:0x4000c053:0x3ffe32a0 0x40008544:0x3ffe32b0 0x40086562:0x3ffe32d0 0x4008a37e:0x3ffe32f0 0x4008a5d5:0x3ffe3330 0x40086207:0x3ffe3350 0x400f113b:0x3ffe3390 0x400f0a0b:0x3ffe33b0 0x400e9e44:0x3ffe33d0 0x400ea001:0x3ffe33f0 0x400e4741:0x3ffe3420 0x4009a995:0x3ffe34a0
0x40086562: delay_ms at D:/OneDrive/ESP32-Master/components/spi_flash/spi_flash_os_func_app.c:88

0x4008a37e: spi_flash_chip_generic_wait_idle at D:/OneDrive/ESP32-Master/components/spi_flash/spi_flash_chip_generic.c:273

0x4008a5d5: spi_flash_chip_generic_erase_block at D:/OneDrive/ESP32-Master/components/spi_flash/spi_flash_chip_generic.c:134

0x40086207: esp_flash_erase_region at D:/OneDrive/ESP32-Master/components/spi_flash/esp_flash_api.c:349

0x400f113b: esp_partition_erase_range at D:/OneDrive/ESP32-Master/components/spi_flash/partition.c:422

0x400f0a0b: esp_ota_begin at D:/OneDrive/ESP32-Master/components/app_update/esp_ota_ops.c:158

0x400e9e44: esp_https_ota_perform at D:/OneDrive/ESP32-Master/components/esp_https_ota/src/esp_https_ota.c:274

0x400ea001: esp_https_ota at D:/OneDrive/ESP32-Master/components/esp_https_ota/src/esp_https_ota.c:399

0x400e4741: do_firmware_update at d:\onedrive\esp32-projects\vcar\build/../main/ota/vupdate.c:31

0x4009a995: vPortTaskWrapper at D:/OneDrive/ESP32-Master/components/freertos/xtensa/port.c:143

Since this log use ets_printf, it should not be an problem.

For some even worse reason, OTA is able to write a few bytes to the OTA partition and at the next boot (automatic restart after failure) the system starts on the "partial" OTA partition or OTA may corrupt the factory partition. And it doesn't start anymore. Until then, in my tests, the OTA validation was correct. But now it seems that it is no longer, it starts a partition with invalid checksum. Maybe OTA updater corrupted factory partition. Failed to start:

I (32) boot: ESP-IDF v4.2-dev-792-g6330b3345 2nd stage bootloader
I (32) boot: compile time 14:30:37
I (33) boot: chip revision: 1
I (37) boot_comm: chip revision: 1, min. bootloader chip revision: 0
I (44) boot.esp32: SPI Speed      : 40MHz
I (49) boot.esp32: SPI Mode       : DIO
I (53) boot.esp32: SPI Flash Size : 4MB
I (58) boot: Enabling RNG early entropy source...
I (63) boot: Partition Table:
I (67) boot: ## Label            Usage          Type ST Offset   Length
I (74) boot:  0 nvs              WiFi data        01 02 00011000 00008000
I (81) boot:  1 otadata          OTA data         01 00 00019000 00002000
I (89) boot:  2 phy_init         RF data          01 01 0001b000 00001000
I (96) boot:  3 factory          factory app      00 00 00020000 00130000
I (104) boot:  4 ota_0            OTA app          00 10 00150000 00130000
I (111) boot:  5 ota_1            OTA app          00 11 00280000 00130000
I (119) boot:  6 nvs_key          NVS keys         01 04 003b0000 00004000
I (127) boot:  7 config           WiFi data        01 02 003b4000 00040000
I (134) boot: End of partition table
I (138) boot: Defaulting to factory image
I (143) boot_comm: chip revision: 1, min. application chip revision: 0
I (150) esp_image: segment 0: paddr=0x00020020 vaddr=0x3f400020 size=0x34770 (214896) map
I (241) esp_image: segment 1: paddr=0x00054798 vaddr=0x3ffb0000 size=0x051d8 ( 20952) load
I (250) esp_image: segment 2: paddr=0x00059978 vaddr=0x40080000 size=0x00404 (  1028) load
0x40080000: _WindowOverflow4 at D:/OneDrive/ESP32-Master/components/freertos/xtensa/xtensa_vectors.S:1730

I (251) esp_image: segment 3: paddr=0x00059d84 vaddr=0x40080404 size=0x06294 ( 25236) load
I (268) esp_image: segment 4: paddr=0x00060020 vaddr=0x400d0020 size=0xc23a0 (795552) map
0x400d0020: _stext at ??:?

I (571) esp_image: segment 5: paddr=0x001223c8 vaddr=0x40086698 size=0x17ad4 ( 96980) load
0x40086698: emac_hal_rx_complete_cb at D:/OneDrive/ESP32-Master/components/esp_eth/src/esp_eth_mac_esp32.c:480

I (629) boot: Loaded app from partition at offset 0x20000
I (630) boot: Disabling RNG early entropy source...
I (630) cpu_start: Pro cpu up.
I (634) cpu_start: Application information:
I (638) cpu_start: Project name:     vcar
I (643) cpu_start: App version:      0fb43f6-dirty
I (649) cpu_start: Compile time:     Mar 24 2020 15:32:18
I (655) cpu_start: ELF file SHA256:  d93936a5ddb5182c...
I (661) cpu_start: ESP-IDF:          v4.2-dev-792-g6330b3345
I (667) cpu_start: Starting app cpu, entry point is 0x40081bb4
0x40081bb4: call_start_cpu1 at D:/OneDrive/ESP32-Master/components/esp32/cpu_start.c:273

I (657) cpu_start: App cpu up.
I (678) heap_init: Initializing. RAM available for dynamic allocation:
I (684) heap_init: At 3FFAE6E0 len 00001920 (6 KiB): DRAM
I (691) heap_init: At 3FFBE400 len 00021C00 (135 KiB): DRAM
I (697) heap_init: At 3FFE0440 len 00003AE0 (14 KiB): D/IRAM
I (703) heap_init: At 3FFE4350 len 0001BCB0 (111 KiB): D/IRAM
I (710) heap_init: At 4009E16C len 00001E94 (7 KiB): IRAM
I (716) cpu_start: Pro cpu start user code
Guru Meditation Error: Core  0 panic'ed (Cache disabled but cached memory region accessed).

Core  0 register dump:
PC      : 0x4009df09  PS      : 0x00060d34  A0      : 0x800d46c8  A1      : 0x3ffe3bc0
0x4009df09: xt_ints_on at D:/OneDrive/ESP32-Master/components/freertos/xtensa/xtensa_intr_asm.S:164

A2      : 0x0300000c  A3      : 0x0100000c  A4      : 0x00000019  A5      : 0x00000000
A6      : 0x00000016  A7      : 0x00000000  A8      : 0x15122500  A9      : 0x00060d20
A10     : 0x3ffb5994  A11     : 0x00008000  A12     : 0x00000001  A13     : 0x00000001
A14     : 0x00000018  A15     : 0x00000000  SAR     : 0x0000001d  EXCCAUSE: 0x00000007
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff

Backtrace:0x4009df06:0x3ffe3bc0 0x400d46c5:0x3ffe3bd0 0x40081ab9:0x3ffe3bf0 0x40081dc5:0x3ffe3c40 0x4007922e:0x3ffe3c80 |<-CORRUPTED
0x4009df06: xt_ints_on at D:/OneDrive/ESP32-Master/components/freertos/xtensa/xtensa_intr_asm.S:163

0x400d46c5: esp_cache_err_int_init at D:/OneDrive/ESP32-Master/components/esp32/cache_err_int.c:70

0x40081ab9: start_cpu0_default at D:/OneDrive/ESP32-Master/components/esp32/cpu_start.c:390

0x40081dc5: call_start_cpu0 at D:/OneDrive/ESP32-Master/components/esp32/cpu_start.c:261

I'm not even sure if there is a valid binary when the OTA updater starts. That is, it could be that http sent error content in html format and was saved as a binary app.

I believe the most important here are:

UART pattern det fails when the queue is full;
Bootloader starting OTA with invalid checksum.

Expected Behavior

UART works normally when write SPI;
UART don't fail do enqueue det pattern;
Partial OTA can't be booted. Or factory parition has been corrupted by OTA updater.

Steps to reproduce

Use NMEA with OTA examples together. UART impact may occur without ppp.

Alvin1Zhang commented 4 years ago

@neorevx Thanks for detailed reports, we will look into.

neorevx commented 4 years ago

There are apparently no problems with the firmware in the SPI memory. After manual reboot (pin EN) it get working. Checking the logs again, he started the program from the factory partition. Only now that I noticed that it is the same failure!

(Cache disabled but cached memory region accessed)

Problem is cached disabled and soft restart.

I don't understand how the system manages SPI memory when writing the OTA partition. Certainly there must be techniques to manage the reading of the code of the running program and the writing of the new binary. There may be a problem with the system cache and reading the SPI during this step. This also explains why other tasks get slower. System cache is disabled... My UART interruption is in IRAM! The uart_pattern_enqueuefunction too! It may be related to IRAM. I believe that you can add a check in the bootloader by default to activate the cache. So the second failure will not happen, even if the first one does.

More one thing: 0x400855e8: spi_flash_op_block_func at D:/OneDrive/ESP32-Master/components/spi_flash/cache_utils.c:103 (discriminator 1) The stack for UART related interrupt come from spi_flash component.

neorevx commented 4 years ago

Another thing not related direct to this issue: pattern det don't work if you compile optimized for performance. Maybe are missing some "volatile" keyword for some vars in UART driver, I don't know.

neorevx commented 4 years ago

I read some docs over the internet and concluded:

Cache is disabled for SPI write;
When cache is disabled and IRAM intr is triggered, only IRAM code can be executed. I think ets_printf is not in IRAM, then call this function generates an exception. ets_printf should be put in IRAM or logs should be removed from interrupt. Or remove intr from IRAM;
When I disabled "place SPI ... in IRAM" I still getting this exception;
When I disabled "place UART ... in IRAM" (but SPI in IRAM), this exception does not happens anymore. I still getting some "fifo full", but it's no problem at moment.

If I disabled IRAM code for SPI, the application will runs slowly? (due reading code from SPI). Or it will impact only writing?

koobest commented 4 years ago

Flash writing operation will disable the cache, the code and data in cache Inaccessible. So, ESP_EARLY_LOGW(UART_TAG, "Fail to enqueue pattern position, pattern queue is full."); may cause a crash. We will think of a solution to this issue. Currently you can avoid crashes by disable ESP_LOG_WARN level of printing. thanks!!.

zheyuejiang commented 3 years ago

any update here?

negativekelvin commented 3 years ago

Here's a hint https://github.com/espressif/esp-idf/commit/a0e66fef0880dba2caf93e9cb9c65cff7dc9b880

DBMTC commented 2 years ago

I might have hit this or a similar error as my stacktrace is: ` E (117340) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time: E (117340) task_wdt: - IDLE (CPU 0) E (117340) task_wdt: Tasks currently running: E (117340) task_wdt: CPU 0: ipc0 E (117340) task_wdt: CPU 1: IDLE E (117340) task_wdt: Aborting.

abort() was called at PC 0x40129cbc on core 0 0x40129cbc: task_wdt_isr at /home/dbmtc/esp/esp-idf/components/esp_common/src/task_wdt.c:182 (discriminator 1)

Backtrace:0x40081882:0x3ffb0980 0x400892b9:0x3ffb09a0 0x4008f8f2:0x3ffb09c0 0x40129cbc:0x3ffb0a30 0x400829c5:0x3ffb0a50 0x40084768:0x3ffaf8a0 0x400834f6:0x3ffaf8c0 0x40083448:0x3ffaf8e0 0x40081882: panic_abort at /home/dbmtc/esp/esp-idf/components/esp_system/panic.c:404

0x400892b9: esp_system_abort at /home/dbmtc/esp/esp-idf/components/esp_system/system_api.c:112

0x4008f8f2: abort at /home/dbmtc/esp/esp-idf/components/newlib/abort.c:46

0x40129cbc: task_wdt_isr at /home/dbmtc/esp/esp-idf/components/esp_common/src/task_wdt.c:182 (discriminator 1)

0x400829c5: _xt_lowint1 at /home/dbmtc/esp/esp-idf/components/freertos/port/xtensa/xtensa_vectors.S:1105

0x40084768: xt_int_enable_mask at /home/dbmtc/esp/esp-idf/components/xtensa/include/xtensa/xtensa_api.h:170 (inlined by) intr_cntrl_ll_enable_int_mask at /home/dbmtc/esp/esp-idf/components/hal/esp32/include/hal/interrupt_controller_ll.h:100 (inlined by) interrupt_controller_hal_enable_int_mask at /home/dbmtc/esp/esp-idf/components/hal/include/hal/interrupt_controller_hal.h:192 (inlined by) esp_intr_noniram_enable at /home/dbmtc/esp/esp-idf/components/esp_system/intr_alloc.c:815

0x400834f6: spi_flash_op_block_func at /home/dbmtc/esp/esp-idf/components/spi_flash/cache_utils.c:124

0x40083448: ipc_task at /home/dbmtc/esp/esp-idf/components/esp_ipc/ipc.c:74 `

my wild guess: as the task ipc0 is very high priority it might take a long time to execute without ever resetting the watchdog as in cache_utils.c this code can be spot:

while (!s_flash_op_can_start) { // Busy loop and wait for spi_flash_op_block_func to disable cache // on the other CPU }

indeed if this takes too much time that crash is very reasonable.

if a poll on a bool is the right thing to do (I don't even think a bool write is atomic enough) I suggest to either reset the watchdog or add a delay (hoping IDLE tak will call wdt_reset in time).

I am also using UART while updating, but I think the relation between those two things is a secondary side effect of a busy wait.

espressif / esp-idf