espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.49k stars 7.26k forks source link

IRAM safe for non IRAM safe operations while Cache is Disable (IDFGH-12538) #13543

Closed filzek closed 1 month ago

filzek commented 6 months ago

Answers checklist.

General issue report

Why do not create a multpoint wrapper that enable a standby entry for a request in cache disable action?

Create a stud IRAMSAFE that hold the call to the NONSAFE function and keep delayed in track while the cache is disable so the performatic should be slower but this will allow the system to be really usable in the end.

Currently the ESP32 IRAM is a complete mess, in the end ESP32 cannot run correctly at all in full production environment with the SDK 5.2.1, the IRAM is always overload and the problems is show all the way in the system.

The cache problem while using the flash write is a complete design problem and havent in the last 5 years be closer to get 100% fixed.

I am really tired of it, always a dev come to scene and say that it will be solved, solved, solved, and isnt solved forever, and for the sure will be never solved.

Basic functions doesnt work.

So, if you build a system with just one function this will works like a charm, but the esp32 wasnt made for it, was made to do hard tasks and for sure it can hold it, but the cache and iram problem is the caos on earth.

I am very disapointed with the new SDK v5.2.1 and I am sure that all the dev team have the same frustration I have now.

I really cant see a shine sun in above the horizon line anymore, I am about to drop the esp32 and start a new nove to another soc!

filzek commented 6 months ago

SDK CONFIG basic, very basic usage and the IRAM is complete useless

sdkconfig_IRAM_DOESNT_FIT.txt

filzek commented 6 months ago

GPTIMER even in IRAM SAFE not working with GPIO call, even all flash safe driver GPIO in IRAM and also PSIRAM safe operations in IRAM, bring problems of cache.

Guru Meditation Error: Core 0 panic'ed (Cache disabled but cached memory region accessed).

Core 0 register dump: PC : 0x40093faa PS : 0x00060035 A0 : 0x80085140 A1 : 0x3ffc1600 0x40093faa: memcpy at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp-elf/src/newlib/newlib/libc/machine/xtensa/memcpy.S:165

A2 : 0x3ffc1610 A3 : 0x3f415ed0 A4 : 0x00000018 A5 : 0x3ffc1610 A6 : 0xbad00bad A7 : 0xbad00bad A8 : 0x00000000 A9 : 0x3ffc15d0 A10 : 0x3ffbb968 A11 : 0xffffffff A12 : 0x00000000 A13 : 0x00000000 A14 : 0x003fffff A15 : 0x0000cdcd SAR : 0x00000012 EXCCAUSE: 0x00000007 EXCVADDR: 0x00000000 LBEG : 0x40093fa4 LEND : 0x40093fc0 LCOUNT : 0x00000000 0x40093fa4: memcpy at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp-elf/src/newlib/newlib/libc/machine/xtensa/memcpy.S:162 0x40093fc0: memcpy at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp-elf/src/newlib/newlib/libc/machine/xtensa/memcpy.S:197

Backtrace: 0x40093fa7:0x3ffc1600 0x4008513d:0x3ffc1610 0x40085d45:0x3ffc1650 0x40085a81:0x3ffc1680 0x40086cd4:0x3ffb7180 0x40087ad1:0x3ffb71a0 0x40098f66:0x3ffb71e0 0x40093fa7: memcpy at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp-elf/src/newlib/newlib/libc/machine/xtensa/memcpy.S:163 0x4008513d: gpio_zc_isr_handler at D:/Dropbox/Dev/Producao5/main/include/zerocrossing.c:507 0x40085d45: gpio_isr_loop at C:/Espressif/frameworks/esp-idf-v5.2/components/driver/gpio/gpio.c:470 (inlined by) gpio_intr_service at C:/Espressif/frameworks/esp-idf-v5.2/components/driver/gpio/gpio.c:492 0x40085a81: _xt_medint3 at C:/Espressif/frameworks/esp-idf-v5.2/components/xtensa/xtensa_vectors.S:1409 0x40086cd4: spi_flash_op_block_func at C:/Espressif/frameworks/esp-idf-v5.2/components/spi_flash/cache_utils.c:138 (discriminator 1) 0x40087ad1: ipc_task at C:/Espressif/frameworks/esp-idf-v5.2/components/esp_system/esp_ipc.c:83 0x40098f66: vPortTaskWrapper at C:/Espressif/frameworks/esp-idf-v5.2/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134

the line 507 is:

gptimer_alarm_config_t us_triac_setup = { .alarm_count = 8400, // period of MICROSSECONDS .flags.auto_reload_on_alarm = true, // disable auto-reload .reload_count = 0, // counter will reload with 0 on alarm event };

so, its impossible that this happen in the sdk.

suda-morris commented 6 months ago

The isr_handler you passed to the gpio_isr_handler_add function should also be placed in IRAM

filzek commented 6 months ago

The isr_handler you passed to the gpio_isr_handler_add function should also be placed in IRAM

Hi Suda-morris,

Off course it is already in IRAM,

r = gpio_install_isr_service(ESP_INTR_FLAG_LEVEL3 | ESP_INTR_FLAG_IRAM );

gpio_isr_handler_add(dispositivos[roda_dispositivos].port_io, gpio_zc_isr_handler, NULL);

static void IRAM_ATTR gpio_zc_isr_handler(void)

so, all is corrected set as always, the problem still with the SDK.

A new found is if we set

gptimer_alarm_config_t us_triac_setup = { .alarm_count = 8400, // period of MICROSSECONDS .flags.auto_reload_on_alarm = true, // disable auto-reload .reload_count = 0, // counter will reload with 0 on alarm event };

in a Global area the crash doesnt happen at all, so the call inside the ISR for it is bringing the crash somehow.

Also there is another crashes happen in the ISR that we are right now investigating and will produce the fully report as well.

florentbr commented 6 months ago

Use low level functions from timer_ll.h and gpio_ll.h. These are all you need to drive a triac with a group timer isr and gpio isr:

// gpio zero crossing ISR:
timer_ll_trigger_soft_capture
timer_ll_get_counter_value
timer_ll_trigger_soft_reload
timer_ll_set_alarm_value
timer_ll_enable_alarm
gpio_ll_set_level

// timer group ISR:
timer_ll_clear_intr_status
gpio_ll_set_level
timer_ll_set_alarm_value
timer_ll_enable_alarm

They are inlined, so no cache access.

If you want something more reliable without interrupts, then use the MCPWM unit to drive your dimmer.

filzek commented 6 months ago

this is not the problem, the problem are direct related with the gptimer collections that somehow are broken in ISR call even ISR SAFE enable in sdkconfig.

Reporting the major problem in code leaked for palce IRAM and not IRAM, that way become difficulto to do simultaneous tasks and the IRAM safe code become broke.

filzek commented 6 months ago

timer_ll_enable_alarm

GPTIMER does use the same hal timer and all are in IRAM, so the code behind the gptimer struct now is somehow corrupted and it is NON-ISR-SAFE

florentbr commented 6 months ago

GPTIMER does use the same hal timer and all are in IRAM, s

I don't see a single IRAM function in the GPTIMER API: https://github.com/espressif/esp-idf/blob/release/v5.2/components/driver/gptimer/gptimer.c

The default isr is not in IRAM either: https://github.com/espressif/esp-idf/blob/2152112ff8913325ff791891a70693edd4b863f9/components/driver/gptimer/gptimer.c#L519

suda-morris commented 6 months ago

@florentbr The function placement is configured by a linker fragment file: https://github.com/espressif/esp-idf/blob/release/v5.2/components/driver/gptimer/linker.lf

suda-morris commented 6 months ago

A new found is if we set

gptimer_alarm_config_t us_triac_setup = { .alarm_count = 8400, // period of MICROSSECONDS .flags.auto_reload_on_alarm = true, // disable auto-reload .reload_count = 0, // counter will reload with 0 on alarm event };

in a Global area the crash doesnt happen at all, so the call inside the ISR for it is bringing the crash somehow.

looks like the us_triac_setup was somehow compiled into a .rodata which is placed in the Flash. If the compiler determines that a structure defined within a function has a constant value and is not modified, it may decide to place that structure in the read-only data section (rodata) instead of allocating it on the stack. (I see you're using -O2 in your sdkconfig).

FYI, gptimer_set_alarm_action is placed in the IRAM already.

ESP-Marius commented 6 months ago

And regarding your original request:

Create a stud IRAMSAFE that hold the call to the NONSAFE function and keep delayed in track while the cache is disable so the performatic should be slower but this will allow the system to be really usable in the end.

So you want a way to delay the function until the cache is enabled? Wouldnt that be the same as just not using the ESP_INTR_FLAG_IRAM flag? :thinking: Or maybe im misunderstanding your request here.

filzek commented 5 months ago

@ESP-Marius Below it the idea

In a live operational environment, the complexity increases significantly when simultaneous actions occur in regions where the cache is disabled. To mitigate issues arising from these conditions, it is critical to ensure that functions executed within interrupt service routines (ISRs) are placed in IRAM. This ensures they remain accessible even when the cache is disabled. Additionally, if a function attempts to access resources that are unavailable due to the cache being disabled, it should be designed to wait until the cache is re-enabled before proceeding. Implementing this approach will prevent errors associated with cache-disabled states and enhance system stability and reliability.

  1. Implement a Monitoring and Control Mechanism: Develop a system that monitors attempts to access cached areas when the cache is disabled. This mechanism should flag any such attempts and hold (delay) the execution until the cache is re-enabled. This ensures that no operations proceed that would result in errors or system instability due to cache inaccessibility.

  2. Notification System for Developers/Users: Establish a notification protocol within the system to inform developers or users about critical actions that are taken automatically. This should include detailing which functions were running when the cache was disabled and which function(s) attempted to access the disabled cache area. This transparency aids in understanding system behavior under various operating conditions.

  3. Enhance Debugging with Compiler Metadata: Utilize compiler-provided metadata, such as file, func, and line, to annotate the code where cache access issues are detected. This makes it easier to trace back issues to their source in the code, facilitating quicker debugging and resolution.

These additions aim to enhance system reliability through better control, monitoring, and communication, while also providing tools to aid in development and troubleshooting.

igrr commented 5 months ago

Additionally, if a function attempts to access resources that are unavailable due to the cache being disabled, it should be designed to wait until the cache is re-enabled before proceeding.

I think ChatGPT misses one point here. The cache is typically disabled by a task performing Flash read/write operations until this operation is complete. If an ISR preempts this task, it is of no use to wait for the cache to be re-enabled: it won't be re-enabled until the ISR returns, control is passed back to the original task and the task finishes the flash operation. (Unless the chip has two CPUs and the flash operation was started on the other CPU; however, even then, waiting for possibly tens of milliseconds in an ISR is unlikely to be a good idea.)


On some of the chips, there is a Kconfig option CONFIG_SPI_FLASH_AUTO_SUSPEND which, when enabled, instructs the hardware to automatically suspend the flash read/write operation and re-enable the cache, resuming the execution of cached code. When control is returned to the original task which started the flash operation, cache is disabled again and the read/write operation is resumed. This all works if the flash chip supports Erase Suspend / Program Suspend features. If this feature is enabled, it's not necessary for the ISR code to be in IRAM for it to be executed during flash operations. See docs and the list of flash chips for more information.

filzek commented 5 months ago

@igrr the key point is solve the crash on cache disable and try to solve IRAM free operation, somehow the idea is to rething the basic concept for the safety and longevity of ESP32 family, either the new bootloader could solve som but not all until the reflash over ota could be made.

filzek commented 5 months ago

The best part is to have a semaphore handle action on every each task that allow immediate stop upon ISR/Cache Disable action and wait until it is reenable, so this could lead to a perfect run environment.

Or somehow a sudden middle task halt and take with a command set back to the ISR, so no cache disable will occur.

ESP-Marius commented 5 months ago

The best part is to have a semaphore handle action on every each task that allow immediate stop upon ISR/Cache Disable action and wait until it is reenable, so this could lead to a perfect run environment.

I am still not completely understanding what you are trying to communicate here.

You want all tasks to stop when cache is disabled to ensure no problems with accessing flash right? But this is exactly what ESP-IDF does today. All tasks/normal ISR will be paused while cache is disabled. Only the ESP_INTR_FLAG_IRAM ISRs will be allowed to run, because they are critical/time sensitive and cannot wait until cache is enabled again.

filzek commented 1 month ago

We have to move some IRAM functions out to free IRAM space and thats okay, but the ISR are more stable in the latest commit as of:

Merge: 1bb33a31b8 1a111187fa Author: Michael (XIAO Xufeng) xiaoxufeng@espressif.com Date: Fri Aug 16 19:35:27 2024 +0800

Merge branch 'refactor/rtc_init_before_mspi_tuning_v5.2' into 'release/v5.2'

fix(startup): move rtc initialization before MSPI timing tuning to improve stability (v5.2)

See merge request espressif/esp-idf!32553

I will close this issue and also open another with some very good information to follow