STM32L0xx Hardfault after sleep while running the debugger

marcemmers commented 6 years ago

Description of defect

Expected behavior Normal use of HAL_PWR_EnterSLEEPMode()

Actual behavior On certain occasions the device crashes with a hardfault after calling HAL_PWR_EnterSLEEPMode(). This is however only when running with the debugger. After some extensive testing it seems that sometimes the flash has some glitches and the stack is popped an additional time causing the SR to be out of sync.

The first image shows the disassembly of the routine as it should be and this functions without issue. The live watch is to indicate that the flash is indeed read correctly.

This image shows the disassembly but this time with wrong instruction according to the live watch. The new instructions are an additional WFE and an additional POP {R4}. The instructions down below also execute and the BX LR on line 0x80097F2 returns with the SR off by 1 causing the hardfault later on.

Target(s) affected by this defect ?

STM32L072CZ

Toolchain(s) (name and version) displaying this defect ?

IAR 8.11.2

What version of Mbed-os are you using (tag or sha) ?

b8d218038bcd8c2ad447d91ab94c927118c1ad61

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

n/a

How is this defect reproduced ?

I am running mbed in Tickless mode. To be able to do this I have defined the mbed_get_m0_tick_irqn as the LCD_IRQn since I won't use that one. The only thing I do in main is set the ULP bit in the PWR->CR register and add a call_every to the shared queue to toggle a led and wakeup every so often:

DigitalOut led1(PA_3);
EventQueue *pQueue = mbed_event_queue();

extern "C"
{
    IRQn_Type mbed_get_m0_tick_irqn()
    {
        return LCD_IRQn;
    }
}

void Led1On()
{
    led1 != led1;
}

int main()
{
    PWR->CR |= 0x200;
    pQueue->call_every(5000, Led1On);
    return 0;
}

After that the only thing running is the idle thread after returning from main. The issue only occurs when MBED_DEBUG is defined or if deep sleep is locked. This means HAL_PWR_EnterSTOPMode is not affected.

Not setting the ULP bit solves the issue but this means we can't go low power as we want to. Adding a couple of nops after the wfi seems to solve the issue, just as was the case in #5396.

The only other reference I could find online was this: https://community.st.com/thread/41602-stm32l1-hardfault-when-returning-from-wfi-only-when-debugger-attached which wasn't answered but seems to be the same issue as I am having now.

jeromecoutant commented 6 years ago

ST_INTERNAL_REF 43922 [Mirrored to Jira]

marcemmers commented 6 years ago

I did think it might have been linked to the prefetch buffer holding the instructions through the sleep because the first two instructions were processed correctly (BN 0x80097e6 and POP {r4}).

However disabling prefetch didn't change this and its still the same instructions that are being switched. [Mirrored to Jira]

0xc0170 commented 6 years ago

IAR 8.11.2

Be aware, we do not yet support IAR 8. [Mirrored to Jira]

adbridge commented 5 years ago

Internal Jira reference: https://jira.arm.com/browse/IOTPART-5627

vznncv commented 4 years ago

I have a similar problem with Hardfault after sleep.

Toolchain

arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors 6-2017-q2-update) 6.3.1

Target

STM32F3Discovery

mbed-os:

sha: 532654ebb31c, tag: mbed-os-6.0.0-alpha-3

Expected behavior

Normal use of has_sleep()

Error message

Error message:

++ MbedOS Fault Handler ++

FaultType: HardFault

Context:
R0: 0
R1: 1
R2: E000ED00
R3: 0
R4: 0
R5: 0
R6: 20001F3C
R7: 0
R8: 0
R9: 0
R10: 0
R11: 0
R12: 0
SP   : 20002730
LR   : 8019D89
PC   : 8019D94
xPSR : 61000000
PSP  : 20002710
MSP  : 20009FC0
CPUID: 410FC241
HFSR : 40000000
MMFSR: 82
BFSR : 0
UFSR : 0
DFSR : 0
AFSR : 0
MMFAR: 7F7
Mode : Thread
Priv : Privileged
Stack: PSP

-- MbedOS Fault Handler --

++ MbedOS Error Info ++
Error Status: 0x80FF013D Code: 317 Module: 255
Error Message: Fault exception
Location: 0x8019D94
Error Value: 0x20002C40
Current Thread: rtx_idle Id: 0x20002198 Entry: 0x8014321 StackSize: 0x280 StackMem: 0x200024E0 SP: 0x20002730
For more info, visit: https://mbed.com/s/error?error=0x80FF013D&osver=999999&core=0x410FC241&comp=2&ver=60300&tgt=DISCO_F303VC
-- MbedOS Error Info --

Disassembled code around 0x8019D94

08019d70 <hal_sleep (mbed-os/targets/TARGET_STM/sleep.c::hal_sleep)>:
 8019d70:   b508        push    {r3, lr}
 8019d72:   f7f8 fc69   bl  8012648 <core_util_critical_section_enter>
 8019d76:   4b08        ldr r3, [pc, #32]   ; (8019d98 <hal_sleep+0x28>)
 8019d78:   681b        ldr r3, [r3, #0]
 8019d7a:   f013 0f01   tst.w   r3, #1
 8019d7e:   d106        bne.n   8019d8e <hal_sleep+0x1e>
 8019d80:   2101        movs    r1, #1
 8019d82:   2000        movs    r0, #0
 8019d84:   f7fd f890   bl  8016ea8 <HAL_PWR_EnterSLEEPMode>
 8019d88:   f7f8 fc76   bl  8012678 <core_util_critical_section_exit>
 8019d8c:   bd08        pop {r3, pc}
 8019d8e:   2101        movs    r1, #1
 8019d90:   4608        mov r0, r1
 8019d92:   f7fd f889   bl  8016ea8 <HAL_PWR_EnterSLEEPMode>
 8019d96:   e7f7        b.n 8019d88 <hal_sleep+0x18>
 8019d98:   40007000    .word   0x40007000

description

Sometimes I get such error after first call of the ThisThread::sleep_for in the code, when I run greentea tests or debug some code with pyocd (version 0.26.0). But I don't get such error when OpenOCD (version 0.10.0) is used for debugging. The error isn't stable and after any changes in the code it can disappear.

For example the following fragment fails with HardFault error:

...

// test entry point
int main()
{
    ThisThread::sleep_for(10); // after "step over" a debugger can show a "Segmentation fault"

    //DigitalOut led(LED2);
    //while (true) {
    //    led = !led;
    //    ThisThread::sleep_for(10);
    //}

    // base config validation
    validate_test_pins(true, true, false);
    validate_test_apn_settings();

    // host handshake
    // note: it should be invoked here or in the test_setup_handler
    GREENTEA_SETUP(600, "default_auto");
    // run tests
    return !Harness::run(specification);
}

whereas the following does not:


...

// test entry point
int main()
{
    ThisThread::sleep_for(10); // after "step over" a debugger doesn't show any errors

    DigitalOut led(LED2);
    while (true) {
        led = !led;
        ThisThread::sleep_for(10);
    }

    // base config validation
    validate_test_pins(true, true, false);
    validate_test_apn_settings();

    // host handshake
    // note: it should be invoked here or in the test_setup_handler
    GREENTEA_SETUP(600, "default_auto");
    // run tests
    return !Harness::run(specification);
}

vznncv commented 4 years ago

Note: I fixed my issue by adding some NOP operations into HAL_PWR_EnterSLEEPMode function:

void HAL_PWR_EnterSLEEPMode(uint32_t Regulator, uint8_t SLEEPEntry)
{
    /* Check the parameters */
    assert_param(IS_PWR_SLEEP_ENTRY(SLEEPEntry));

    /* Clear SLEEPDEEP bit of Cortex System Control Register */
    SCB->SCR &= (uint32_t) ~((uint32_t)SCB_SCR_SLEEPDEEP_Msk);

    /* Select SLEEP mode entry -------------------------------------------------*/
    if (SLEEPEntry == PWR_SLEEPENTRY_WFI) {
        /* Request Wait For Interrupt */
        __WFI();
    } else {
        /* Request Wait For Event */
        __SEV();
        __WFE();
        __WFE();
    }

    // At least 4 NOP operations (size of one FLASH prefetch buffer) fix this bug,
    // but we add more NOP operations for more confidence.
    __NOP();
    __NOP();
    __NOP();
    __NOP();
    __NOP();
    __NOP();
    __NOP();
    __NOP();
}

jeromecoutant commented 4 years ago

@vznncv very interesting.

Does the same change as #12662 could also solve your issue ?

@LMESTM

LMESTM commented 4 years ago

I guess the same as above would most probably solve the issue. I also still think that the PC scripts that are loading the binary into the target are most probably modifying the DBGMCU_CR register and that this should be set back to its default value when it disconnects or at least when it is only used for Flashing and not for debugging ...

0xc0170 commented 4 years ago

What action should resolve this issue (we have #12662 fix but only for F4)

LMESTM commented 4 years ago

What action should resolve this issue (we have #12662 fix but only for F4)

Applying similar as #12662 for L0 is most probably a good work-around.

Finding out which SW entity of the download chain is modifying the DBGMCU_CR register and modify this entity to restore the register state would be a better fix I think - but I have no clue where this happens

ciarmcom commented 3 years ago

@marcemmers thank you for raising this issue.Please take a look at the following comments:

Could you add some more detail to the description? A good description should be at least 25 words. What target(s) are you using? What toolchain(s) are you using? What version of Mbed OS are you using (tag or sha)? It would help if you could also specify the versions of any tools you are using? How can we reproduce your issue?

NOTE: If there are fields which are not applicable then please just add 'n/a' or 'None'.This indicates to us that at least all the fields have been considered. Please update the issue header with the missing information, the issue will not be mirroredto our internal defect tracking system or investigated until this has been fully resolved.

adbridge commented 3 years ago

We've updated our automation, I will fix the requirements .

ciarmcom commented 3 years ago

Thank you for raising this detailed GitHub issue. I am now notifying our internal issue triagers. Internal Jira reference: https://jira.arm.com/browse/IOTOSM-2457

ARMmbed / mbed-os