RIOT-OS / RIOT

RIOT - The friendly OS for IoT
https://riot-os.org
GNU Lesser General Public License v2.1
4.85k stars 1.97k forks source link

stm32152re: hardfault when DBGMCU_CR_DBG* bits are set and branch after __WFI() #14015

Closed fjmolinas closed 7 months ago

fjmolinas commented 4 years ago

Description

This issue wants to document a recurrent issue that has been seen on stm32l152re platforms.

The history so far:

When #7385 was introduced when going to sleep (i.e. calling __WFI()), irq_enable was changed to irq_restore and that broke stm32l152re.

8518 fixed the issue by introducing a __NOP() after __WFI(). This fixed the issue until #11159 where the call to cortexm_sleep was changed and now __NOP() didn't fix the issue but instead somehow triggered it.

11159 re-introduced the issue by changing the way pm_set_lowest() was called since pm_set() was now implemented for STM32L1. A single __NOP() did not fix the issue anymore.

In #11820 it was discovered that the changes in #7385 didn't actually break the code, but broke the code only when DBGMCU_CR_DBG_STANDBY | DBGMCU_CR_DBG_STOP | DBGMCU_CR_DBG_SLEEP where enabled. By default openocd sets these bits after an examine-end event. This is done by default for all stm32 boards.

In #11919 Since the problem was the branch, with 5d96127 irq_restore was replaced by __set_PRIMASK(state); which inlines the function call avoiding the jump and the whole issue all together.

13999 inline the implementation of irq_restore so the fix in #11919 will be removed.

The faulty scenario

So from debugging output at some point after wake-up the pc gets corrupted and un-reachable instructions gets executed. This only happens when DBGMCU_CR_DBG_STANDBY | DBGMCU_CR_DBG_STOP | DBGMCU_CR_DBG_SLEEP are set. An example for the debugging output is here:

2019-07-09 17:38:58,314 - INFO # Attempting to reconstruct state for debugging...
2019-07-09 17:38:58,315 - INFO # In GDB:
2019-07-09 17:38:58,316 - INFO #   set $pc=0x7822d5e
2019-07-09 17:38:58,317 - INFO #   frame 0
2019-07-09 17:38:58,317 - INFO #   bt
2019-07-09 17:38:58,318 - INFO # 
2019-07-09 17:38:58,319 - INFO # ISR stack overflowed by at least 16 bytes.

Hints to cause

Looking around in stm32 and cortex-m3 erratas and datasheet. I found a mention of a similar issue with stm32f4 in this ERRATA section 2.1.3. In this errata there are some hints as issue that happen the WFE/WFI are placed at 4 byte alignment and problems with the pref-etch buffer. Although this is a differtent cpu (cortex-m4), it made me snoop around the pref-etch buffer and made me think a similar issue might be happening on cortex-m3

In the case of cortex-m3 the pref-etch buffer can fetch two 32bits instructions or 4 16bits instructions but only in sequential code execution. In our code the PRFTEN and ACC64 are enabled so we are reading 64 bits at a time. It stm32l1xxx reference manual it is stated:

When the code is not sequential (branch), the instruction may not be present neither in the
current instruction line used nor in the prefetched instruction line. In this case, the penalty in
terms of number of cycles is at least equal to the number of Wait States.

This lead me to believe that for some reason when the branch instruction is present it is executing a corrupted pre-fetch buffer instruction, or in other terms an instruction that isn't present. For some reason this only happens when the HCLK and FCLK stays enabled in sleep mode. This might have something to do with different wake-up times since the clock is always enabled for the core?? I wasn't able to find many details of what happens on wake-up, and what could be different when HCLK stays enabled.

Steps to reproduce the issue

This issue does not show up currently in master unless a single __NOP is added after https://github.com/fjmolinas/RIOT/blob/e7a1b40cde17dc5f407c9b3884a2603ab656ac7e/cpu/cortexm_common/include/cpu.h#L172. This may change depending on current master, since for a while a single __NOP() fixed the issue.

Expected results

No crash ever..

Actual results

Has crashed in the past.

Possible FIXES

If the issue shows up again 3 __NOP could fix the issue.

cbiffle commented 1 year ago

Random bystander here: it is possible you're hitting the same STM32 WFI bug that I've been hitting. I've got a workaround you might try.

Short version: put an isb after your wfi.

Long version: https://cliffle.com/blog/stm32-wfi-bug/

Hope that helps!