riotboot: ECC faults (eg. in STM32L5 or STM32WB) not handled gracefully

chrysn commented 2 years ago

Description

Flash in STM32L5 and STM32WB, which works in 64bit words with 8bit ECC, can produce NMIs when reading faulty flash -- in particular I expect (sic!; didn't verify) that power loss during writing would create such a situation.

(And I am firmly on the stand point that "Please do not power off the device while saving" is a user request befitting only to 20th century game consoles).

Steps to reproduce the issue

Note that this is pure speculation, but founded in the reference manuals and code paths.

Use riotboot in an A/B setup, write image B, and cut power while writing the header.

(If riotboot ever grew verification of the full image, this would become more of a problem; with the header it's really a very short race).

Expected results

On next startup, image B is detected as invalid because its header doesn't check out.

"Actual" results

riotboot reads image B's header, encounters a flash read error due to the ECC violation, runs into the default NMI handler and reboots ad infinitum.

Possible resolution

Sadly, the flash read errors are really non-maskable, even though riotboot might manage to perform a "careful read" on that flash area.

A possible solution would be to define, when "careful reads" are enabled, to define

static uint8_t *careful_read_start;
size_t careful_read_length;

(possibly under mutex guard, so that no two concurrent careful read operations ever happen, or as a linked list of struct careful_read areas). The NMI would in that case be extended such that if the NMI cause was a flash error (i.e. the ECCD bit is set), it checks whether the faulty address (ADDR_ECC) is in the (or any) "careful read" slices.

From there, I'm not too sure: If an NMI can return safely (can it at all? If so, how can it be sure that only this one bit triggered the NMI, and not something else as well?), it just does (which probably returns garbage to the read; possibly also setting some flag around careful reads that the bootloader later checks). If it can not, then the NMI handler would write zeros into that address or erase the page (we'd need a different term for "careful read" ... "nuclear read"? "read but on error take the whole thing to a safer state and reboot" reads?) and reset.

Either way, the outcome would be that the one good image still comes up.

kaspar030 commented 2 years ago

(And I am firmly on the stand point that "Please do not power off the device while saving" is a user request befitting only to 20th century game consoles and Windows).

fixed that for you

kaspar030 commented 2 years ago

So, the header is only written if the rest of the image has been verified. and the header itself is checksum protected. Is it possible to disable ECC only while reading the header?

chrysn commented 2 years ago

As far as I understand the data sheet, the only way to "disable" ECC is to do something inside the NMI.

kfessel commented 2 years ago

https://www.st.com/resource/en/reference_manual/rm0434-multiprotocol-wireless-32bit-mcu-armbased-cortexm4-with-fpu-bluetooth-lowenergy-and-802154-radio-solution-stmicroelectronics.pdf 3.3.3

seems like the do something is:

check which of the three sources for the nmi was triggered
if flash ecc (FLASH_ECCR has ECCD set) reset that ECCD-bit and continue (maybe react to that bad ecc)

maybe one also needs to have a Flash Interupt handler that checks for ECCC

chrysn commented 2 years ago

Continuing needs a lot of care:

We'd need to be sure that there are really only the known sources of NMIs (the cited document lists them on page 364, others may not).
We need to be sure that no race conditions occur (when clearing one cause and the other happens)
We need to be sure not to miss unrelated faults. (For example, a second flash ECC fault happening outside the currently-being-read area would go unnoticed, briefly disabling all ECC effectively; but the priority of the NMI might help here, maybe it'd double-fault right away?).
(And I don't know yet what happens if NMI even returns).

I guess that "mark as faulty and reboot" might be an acceptable result for way less effort. To avoid the dichotomy between "full NMI clearing" and "start wiping memory on ECC fault", there may be a third option like storing the address in reboot-safe memory, rebooting, and relying on the caller of "careful read" to look there before attempting access. (Not through-through, just to be sure we keep our options open).

kfessel commented 2 years ago

Continuing needs a lot of care:

* We'd need to be sure that there are really only the known sources of NMIs (the cited document lists them on page 364, others may not).

If the NMI cannot be identified -> do something safe (e.g. reboot/reset or stop) the document talks about three sources eg:

if flash eccd -> handle that and return
else if sram parity -> handle that and return
else if hse css ->  handle that and return
else reset / reboot / stop

* We need to be sure that no race conditions occur (when clearing one cause and the other happens)

While in NMI nothing but reset can happen (or hardfault which will halt the whole system until reset)

* We need to be sure not to miss _unrelated_ faults. (For example, a second flash ECC fault happening outside the currently-being-read area would go unnoticed, briefly disabling _all_ ECC effectively; but the priority of the NMI might help here, maybe it'd double-fault right away?).

If i am not misreading the document, it is either an:

ECCC (one bit gone in that memory cell and that will be corrected (possibly issuing a Flash-Int and setting ECCC) or
ECCD (two bit error in that memory cell, can only be detected and leads to NMI and ECCD set). If I read that correctly: There can never be ECCC and ECCD set (just ECCC will continue silently and if not handled and ECCD might happen without setting the FLAG ("While ECCC or ECCD is set, FLASH_ECCR is not updated if a new ECC error occurs. FLASH_ECCR is updated only when ECC flags are cleared") but the NMI will be called -> Flash Int needs to handle ECCC if you want to know if ECCD was the NMI-Source.

If one memory cell yields an ECCC and another one also ECCC it is still ECCC. No double fault upgrade of ECCC to ECCD each read is either correct, correctable or faulty.

* (And I don't know yet what happens if NMI even returns).

What happens when an interrupt returns?

chrysn commented 2 years ago

if flash eccd -> handle that and return
else if sram parity -> handle that and return
else if hse css ->  handle that and return
else reset / reboot / stop

This implies that they are mutually exclusive. Maybe an ECCD and HSE CSS fault can occur at at the same time. If NMIs are level triggered by the presence of any of these bits, then we can implement such logic easily, but I haven't found the right documentation that states that that is so.

ECCC (one bit gone in that memory cell and that will be corrected (possibly issuing a Flash-Int and setting ECCC) or

yes, that's a separate interrupt. (Might make sense to record it so that the device can ask for a firmware update for its other slot at the earliest convenience).

ECCD (two bit error in that memory cell, can only be detected and leads to NMI and ECCD set). If I read that correctly: There can never be ECCC and ECCD set (just ECCC will continue silently and if not handled and ECCD might happen without setting the FLAG ("While ECCC or ECCD is set, FLASH_ECCR is not updated if a new ECC error occurs. FLASH_ECCR is updated only when ECC flags are cleared") but the NMI will be called -> Flash Int needs to handle ECCC if you want to know if ECCD was the NMI-Source.

If one memory cell yields an ECCC and another one also ECCC it is still ECCC. No double fault upgrade of ECCC to ECCD each read is either correct, correctable or faulty.

I'm only considering the ECCD here.

Between the time the first ECCD occurs and some unrelated ECCD, the second ECCD would pass silently -- something our "clear and continue" handler would introduce.

The example is contrived (especially it doesn't happen in riotboot, but the larger issue affects anyone who writes to flash, and riotboot happens to be the only full application we ship that does), but should still illustrate what I'm saying here:

Assume we do a "careful read" while some DMA operation is in progress that writes data from flash to UART (maybe part of stdout). The "careful read" operation reads out the faulty ROM entry, sets FLASH_ECCR with address and ECCD, and trips the NMI. While the NMI is being entered, the stdout DMA also finds a faulty cell, but does not set ECCD because it already is set.

The second ECCD fault would thus go unnoticed.

Granted, it is highly unlikely that two faults happen in the same chip in the first place (one can be part of an aborted flash operation, but only one flash operation fails per surprise power-off), and even more unlikely that they race.

What happens when an interrupt returns?

Control flow continues where it was before -- but all the interrupts I've ever handled were "regular" ones that effectively came to pass between two instructions (i.e. instruction is processed, PC is increased, and state is saved as part of the interrupt procedure); here the interrupt happens during processing of the instruction when the instruction can not complete, and I wouldn't know how to tell (except exerimentally) whether the faulting instruction completes, if so with which value, or not (in which case the load would ... be retried and fail again unless fixed?).

So if all this can be managed and understood -- sure, would be nice. But if any of that causes trouble, I'd personally go for the low-tech solution of clearing the memory cell (users of that extra read functionwould need to be aware that any write unit may be cleared out) and starting over.

(Clever users might want to provide mechanisms to re-obtain and re-store the very values that should have gone into that flash area, but that too is a 20% use case).

chrysn commented 2 years ago

Follow-up post because GitHub's new quote detection completely ignores how the signature separator looks like, and interprets my markdown \

as something it is not -- repeating what is hidden in the above post and irrecoverable because editing a mailed comment breaks formatting worse yet.

So if all this can be managed and understood -- sure, would be nice. But if any of that causes trouble, I'd personally go for the low-tech solution of clearing the memory cell (users of that extra read functionwould need to be aware that any write unit may be cleared out) and starting over.

(Clever users might want to provide mechanisms to re-obtain and re-store the very values that should have gone into that flash area, but that too is a 20% use case).

jeandudey commented 2 years ago

can it at all? If so, how can it be sure that only this one bit triggered the NMI, and not something else as well?

A HardFault triggers if the BusFault NMI is not enabled in the SCB, so with the BusFault enabled and the SCB->BFSR and SCB->BFAR can be used to identify where the error happened. So one can identify the page/sector by comparing the BFAR register and the areas that riotboot might write (e.g.: the firmware header), and then decide how to handle the situation (e.g.: deleting the sector to remove the ECC error, write a 0 on the address, etc).

RIOT-OS / RIOT