adafruit / uf2-samdx1

MSC bootloader (based on UF2) for SAMD21
Other
212 stars 183 forks source link

Flash corruptioin on SAMD51 #217

Open sjev opened 1 month ago

sjev commented 1 month ago

This issue is similar (and probably related) to #170

The main difference is that in this case it occurs on SAMD51 and memory corruption occurs at 0x4000 with the result that the board loses the circuitpython install.

Hardware used is based on Feather M4 CAN, schematics are here

I haven't done measurements of the power-up and -down voltage curves, but I suspect it's a manifistation of the brownout issue.

note: before I ran update-bootloader-feather_m4_can-v3.16.0.uf2 on v3.16 bootloader, multiple devices were bricking with memory corruptioin at 0x000000, same behavior as described in #170. After running the update, one board does not show any problems after ~200 power cycles, while this one particular board fails within 20. I'm not sure what changed exactly between 3.16..bin and update-.. .uf2

Symptopms

  1. device was 'soft bricked' by a power cycle.
  2. restored bootloader to v3.16 with programmer, device loaded UF2 bootloader and appeared as "FTHCANBOOT".
  3. updated bootloader with update_....uf2
  4. put circuitpython on it, code started running.
  5. removed usb, power-cycled approx 5 times. After 'normal' restarts device jumped into reset mode and appeared as "FTHRCANBOOT" again. Rebooting and pressing reset button has no effect. Essentially, python install was lost.
  6. put circuitpython on it, repated previous step with same result.

I have reproduced this several times, the issue usually occurs within 20 power cycles.

Clearing and setting BOD33_DIS bit with the programmer did not change anything.

Analysis

I've compared corrupted mememory dump to a working one. There is a difference at address 0x4000 . Just one line in intel hex files is different.

Summarized, data diff at address 0x4000

F8 FF 02 20 CD 57 00 00 75 DD 04 00 69 DF 04 00  (working)
F8 FF 02 00 00 00 00 00 75 DD 00 00 60 DF 04 00  (broken)

Below is output from my analysis script with details:

working

original hex file:
[line 1025] :10400000F8FF0220CD57000075DD040069DF0400D1

Decoded HEX line:
Byte Count: 16
Address: 0x4000
Record Type: 0 (Data)
Data: F8 FF 02 20 CD 57 00 00 75 DD 04 00 69 DF 04 00
Checksum: 0xD1 (Valid)

Decoded Data:
  Address 0x4000: 0xF8
  Address 0x4001: 0xFF
  Address 0x4002: 0x02
  Address 0x4003: 0x20
  Address 0x4004: 0xCD
  Address 0x4005: 0x57
  Address 0x4006: 0x00
  Address 0x4007: 0x00
  Address 0x4008: 0x75
  Address 0x4009: 0xDD
  Address 0x400A: 0x04
  Address 0x400B: 0x00
  Address 0x400C: 0x69
  Address 0x400D: 0xDF
  Address 0x400E: 0x04
  Address 0x400F: 0x00

broken

corrupted .hex file:
[line 1025] :10400000F8FF02000000000075DD000060DF040022

Decoded HEX line:
Byte Count: 16
Address: 0x4000
Record Type: 0 (Data)
Data: F8 FF 02 00 00 00 00 00 75 DD 00 00 60 DF 04 00
Checksum: 0x22 (Valid)

Decoded Data:
  Address 0x4000: 0xF8
  Address 0x4001: 0xFF
  Address 0x4002: 0x02
  Address 0x4003: 0x00
  Address 0x4004: 0x00
  Address 0x4005: 0x00
  Address 0x4006: 0x00
  Address 0x4007: 0x00
  Address 0x4008: 0x75
  Address 0x4009: 0xDD
  Address 0x400A: 0x00
  Address 0x400B: 0x00
  Address 0x400C: 0x60
  Address 0x400D: 0xDF
  Address 0x400E: 0x04
  Address 0x400F: 0x00

Follow-up

I've taken a look at main.c, there seems to be as section for brownout protection for SAMD51.

I'm willing to invest some time to fix this, but fiddling with bootloaders is not something that I've done before...

I'd like to discuss possible solutions here before I start (randomly) changing stuff.

sjev commented 1 month ago

A bit further on I found the cause of this and was able to reliably avoid and reproduce the issue.

Short description:

  1. measured 3.3V rise time during turn on- it was looking great with a linear rise time of 1.5 ms.
  2. looked at the fall time during turn off - the voltage fell within 5 ms to about 1V and then stayed there for a long time.
  3. added a 150 ohm load to 3.3V, this resulted in quicker fall after 1V level.
  4. tried to reproduce the issue with load - did not occur after around 100 cycles.
  5. removed load - issue occured within 10 cycles.

So the hypothesis is that brownout protection is not acting as it should during switch off.

In the screenshot below the traces show voltage curves at turn-off with and without 150 ohm load.

data_59266

sjev commented 1 month ago

image

current configuration bits.

Note: I've changed BOD33_ACTION to "RESET" later and was still able to cause the issue.

sjev commented 1 month ago

@dhalbert, I saw your commet on adafruit forum, as you suggested, let's discuss further in this thread.

by danhalbert » Mon Jul 15, 2024 12:37 pm

Hi - are you literally using a Feather M4 CAN, or are you reproducing the design on your own board?

The "bricking" you describe is some kind of problem in the SAMD51 chip design: if there is a power glitch at the right time, an internal flash write or erase can occur that erases the first unprotected block in flash. This is not related to CircuitPython per se: if you wrote an Arduino program, for instance, it might have the same problem.

I would suggest doing oscilloscope monitoring of the power-on waveform, and also whether noise is getting into the line.

I'm pretty sure that corruption occurs on power-down cycle as I was able to reliably reproduce it without bleed resistor and coulde not with it. Still, a bleed resistor is just a work-around in the short term.

One possible cause that I can think of is current leak though one of the input pins, we'll remove it in next iteration of the design and see if it has any effect.

dhalbert commented 1 month ago

Your testing is interesting.

What firmware are you using in normal operation? Is it CircuitPython, Arduino, or something else? Perhaps the firmware is changing with the brownout detection.

I looked at the bootloader code again. It enables a BOD33 level at around 2.7V. It does not enable hysteresis, which maybe it should. We could try bumping up the BOD33 and enabling hysteresis. On the SAMD21, hysteresis is just on or off. At around 2.7V BOD33, it looks like it's about 70mV. On the SAMD51, there is a a 4-bit field with 6mV steps. I don't have any experience in choosing this value but we could try about the same 70mV.

When I discussed this kind of problem with Microchip in the past, it was an issue about power glitches on power-up. That was the motivation for the current code, which is all about waiting enough time for the voltage to stabilize on power-up. Your problem seems to be on power-down. Your scope trace does not show any glitches on power down, but I wonder if the longer timebase chosen is hiding something, though I don't see any evidence of that.

What kind of power supply are you using? Have you tried a different power supply to see if that makes any difference?

Microchip also said they had seen this flash erase problem when there was insufficient decoupling capacitance on Vddcore. Are your decoupling caps close to the SAMD51 chip?

Are the power pins on the SAMD51 wired the same way as the Adafruit board, or are they somewhat different? We go by the reference designs in the datasheet.

Is it possible to test this on a board other than yours, with the same power supply and external connections? For instance, do you see this problem on the SAMD51 Feather CAN?

Here is the Feather CAN power arrangement. There are more decoupling caps not shown here as well:

image

sjev commented 1 month ago

@dhalbert Thanks for your input!

What firmware are you using in normal operation? Is it CircuitPython, Arduino, or something else? Perhaps the firmware is changing with the brownout detection.

I'm using CircuitPython 9.0.5 with latest version of UF2 bootloader (3.16, updated with update-xxx.uf2. The code that I'm running is just a blinky on neopixel, no write access whatsoever and not touching the tuses.

... When I discussed this kind of problem with Microchip in the past, it was an issue about power glitches on power-up. That was the motivation for the current code, which is all about waiting enough time for the voltage to stabilize on power-up. Your problem seems to be on power-down. Your scope trace does not show any glitches on power down, but I wonder if the longer timebase chosen is hiding something, though I don't see any evidence of that.

I'll record some longer traces, jsut to be sure.

What kind of power supply are you using? Have you tried a different power supply to see if that makes any difference?

I'm using two different lab supplies with same results. Important to note that I'm turning the device on and off in a rough manner, manually connecting and disconnecting power wires.

Microchip also said they had seen this flash erase problem when there was insufficient decoupling capacitance on Vddcore. Are your decoupling caps close to the SAMD51 chip?

Yes, but these may be smaller than on a feather. The schematics are here btw.

Are the power pins on the SAMD51 wired the same way as the Adafruit board, or are they somewhat different? We go by the reference designs in the datasheet.

Yes, we also try to follow reference and feather designes as closely as possible

Is it possible to test this on a board other than yours, with the same power supply and external connections? For instance, do you see this problem on the SAMD51 Feather CAN?

This should be possible, but I'd need to somehow simulate the power-down curve on the feather. This should require some hacking. and I don't have a function generator atm that I could use for that.

dhalbert commented 1 month ago

Important to note that I'm turning the device on and off in a rough manner, manually connecting and disconnecting power wires.

That could cause power glitches, though you didn't trace any. I've seen that myself just bobbling a USB plug a bit.

The scope trace picture that you posted, is that TEST_VDDCORE1, or is it VCC3V3?

There is a lot going on in your power supply circuitry, and there is opportunity for noise, maybe pins going out of range. Is it possible to supply just 3.3V to the SAMD51 and see if you can duplicate the problem?

Is it possible to test this on a board other than yours, with the same power supply and external connections? For instance, do you see this problem on the SAMD51 Feather CAN?

This should be possible, but I'd need to somehow simulate the power-down curve on the feather. This should require some hacking. and I don't have a function generator atm that I could use for that.

At least measure the power-down curve, and see if you can reproduce the flash erasure problem. But we haven't had reports of flash erasure since we re-did the bootloader.

CircuitPython does set BOD33, but it sets it to the same value as the bootloader setting.

Another thing to try would be to write an Arduino program that's equally simple and see if you get the same problem. Probably yes, but that would eliminate CircuitPython itself as cause.

dhalbert commented 1 month ago

Another small possibility: the BOD12 calibration value is set at the factory. From the datasheet:

Brown-out detector internal to the voltage regulator for VDDCORE. BOD12 is calibrated in production and its calibration parameters are stored in the NVM User Row. This data should not be changed if the User Row is written to in order to assure correct behavior.

If you have accidentally erased this value when doing initial chip programming, that might cause a problem.

sjev commented 1 month ago

@dhalbert thank you so much for these pointers. I'll definately investigate these further when I get back to this issue. That will probably be in a couple of weeks from now, as I'm waiting for more boards to be made. I'll probably start with an automated setup that switches power with a relay and waits for some feedback from the board that is tested.

sjev commented 2 days ago

A quick update - We've completed our 0-series and I've used one of them to randomly cycle power on 5 other boards. They went through 4k cycles without any issues. But just when I was about to call this issue an 'incident', one of the boards failed after 5.5k cycles.

The positive news is that with the latest bootloader the board is not getting bricked, just cpy install gets corrupted. Dropping a new uf2 file fixes the issue. As a short-term solution we've added a bleed resistor that can be turned on with a solder jumper. I've enabled it on the board that has failed, we'll see how it holds.

20240902_194522

sjev commented 2 days ago

BTW, is it possible to protect flash memory where cpy resides?

dhalbert commented 2 days ago

BTW, is it possible to protect flash memory where cpy resides?

We haven't provided a mechanism to do that. But the NVMCTRL.RUNLOCK NVM LOCKS bits on the User Page allow you to lock regions of flash. You could try changing these bits manually after loading CircuitPython. See sections 9.4 and 25.6.2 in the datasheet.

sjev commented 1 day ago

Another thing to try would be to write an Arduino program that's equally simple and see if you get the same problem. Probably yes, but that would eliminate CircuitPython itself as cause.

Done with expected result (flash corruption), so it's not circuitpython.

Quick summary of issue occurance (6 test boards):

sjev commented 1 day ago

You could try changing these bits manually after loading CircuitPython.

@dhalbert could you please provide a pointer on how to set the register from CP?

dhalbert commented 1 day ago

@dhalbert could you please provide a pointer on how to set the register from CP?

There's no way to do that from CircuitPython code. These are "fuse" bits, so there's special setup needed to change them. There is code in the bootloader to do this: see the code that corrects errors in the fuses.

I meant that after the CircuitPython UF2 is loaded, you could connect to the board with, say, a hardware debugger and set those bits. For instance, I think you can write a script using a J-Link utility to do this. It's also possible from the MicroChip IDE's to do it by hand. Or you could make a special build of CircuitPython that checks the bits and sets them if necessary. And undoing them is also needed, if you want to be able to update CircuitPython. But I don't have a recipe for you off the bat.

It still sounds like there might be something marginal about the power supply or the decoupling capacitors, which is causing a power dip.

Is there any difference on the date codes of the SAMD51's that indicates the one bad board has a different rev chip?

I think this is also something you could bring up with MicroChip as a support case. They might have some advice for you. Also read the datasheet errata carefully.