adafruit / uf2-samdx1

MSC bootloader (based on UF2) for SAMD21
Other
210 stars 182 forks source link

Feather M4 express: bytes above BOOTPROT being zeroed #95

Closed heymanrl closed 4 years ago

heymanrl commented 4 years ago

Complete details provided in this post: https://forums.adafruit.com/viewtopic.php?f=57&t=158718

Feather M4 Express does not launch user code because memory locations at the beginning of unprotected FLASH memory are being cleared (set to 0x0) during power-on.

Arduino IDE v1.8.10 Adafruit Feather M4 Express; part #3857 Adafruit FeatherWing OLED 128x32; part #2900 Adafruit bootloader v3.7.0 Adafruit_SSD1306 library v2.0.2

No additional hardware is required. Just stack a Feather and OLED, install 'my version 3' then cycle power on/off. The failure usually occurs within 100 on/off power cycles but it is random and may require thousands of power-on-cycles to observe the failure. I automated cycling power at 500ms ON, 500ms OFF.

'my version 3' application code provided below

[EDIT: turned into an attachment for easier downloading -- @dhalbert]

v3.ino.txt

tannewt commented 4 years ago

Hi @heymanrl, thank you for all the investigation you've done on this.

We've seen the issue with blanking the very first bit of memory before on the SAMD51 and set BOOTPROT to work around this issue. (This was when CircuitPython ran after the bootloader.) This is the first time I've heard of blanking the memory just after the protected area.

My assumption was that it is a mix of software and hardware that cause the issue. Perhaps the DMA controller is the issue and it's only used in some sketches. Or maybe its a bug with the USB peripheral. I doubt it's the software alone because both the UF2 and CircuitPython code runs on the SAMD21 without issue.

The next step to me in my mind would be to replicate it with lower level primitives and hone in on the lowest level register write that causes the issue. Reaching out to MicroChip would also be good. Maybe they've seen this issue too.

Unfortunately, I don't think we at Adafruit have the cycles to debug this though. It is rare enough to impact very few people, very sporadically.

It sounds like you've got a very thorough testing setup for this. Let us know if we can do anything to support your investigation.

dhalbert commented 4 years ago

@heymanrl In https://forums.adafruit.com/viewtopic.php?f=57&t=158718&start=15#p784997, You mentioned selfmain.c and startup_samd51.c as possible culprits. selfmain.c is only included in the update-bootloader... executables and is not part of the regular bootloader. The code in startup_samd51.c is clearing RAM. I'm not sure what would be causing the glitch that would write flash, since the bootloader has to go to some effort to enable writing.

If you leave the OLED code in, but remove the OLED device, do you ever get this problem? (I assume the code fails early because it can't talk to the OLED.) I'll study the code later to see what might get it into a state where it wants to write.

heymanrl commented 4 years ago

I can try with the OLED device removed.

I have repeatedly set BOOTPROT to zero, memory location starting at 0x0000 gets cleared. Set BOOTPROT to 16K clears location 0x4000; set to 24K clears 0x6000; set to 120K clears 0x1E000.

I can't trace into the bootloader code but assume its because I need to create two projects in AS7 Solution, (one the bootloader and the other my code). After that, I should be able to track it down with the ICE, look at the Call Stack and maybe even break on a data change at the suspect memory locations.

If anyone can help me get the bootloader added to my AS7 solution, I think I can find the problem.

dhalbert commented 4 years ago

@heymanrl I spent a little more time on this. Thank you for all your debugging. I have some questions and some speculations.

As @tannewt mentioned, we have seen something similar before, here: https://github.com/adafruit/circuitpython/issues/869. In that case BOOTPROT was not set at all. The first 8 bytes of flash were zeroed, and also at least the 512th byte (hard to tell because it's surrounded by zeros). If indeed these are the same problem (though the number of bytes zeroed is different, maybe?), then this problem may not be CircuitPython, as we originally thought.

Your experimental results setting different values of BOOTPROT are very intriguing. The only code that deals with the BOOTPROT value is in selfmain.c, which isn't even included in the regular bootloader, only in the update-bootloader... program. It is as if something is writing to flash starting at 0, and keeps incrementing the address and trying until it succeeds, once it has gotten past the BOOTPROT region.

Questions for you, if you are able to respond:

  1. How many bytes do you see zeroed just above the BOOTPROT region? I think you mentioned 16 bytes in the forum thread. Is that always true, and do you see the same number when you vary the size of the BOOTPROT region? Did you see any other bytes cleared? Getting a dump of all of flash and comparing it with an undamaged snapshot would be interesting.

  2. If I understand correctly, this happens only with the program above, and you narrowed it down to a particular line in the program, as described in the forum thread. Using Blink or slight variations on your SSD1306 program cause the problem to disappear. Is that correct?

  3. How are you power-cycling automatically? Are you removing 5V power completely, and how are you doing that, in terms of hardware? I was going to try to set up something similar, and was thinking of trying some other ways to do this, to see if they change what happens:

    • a. toggling the EN pin on the Feather, which enables/disables the 3.3V regulator (which powers the OLED board as well), and
    • b. toggling the RESET pin, which would maintain the 3.3V power.

Thanks again for your perseverance in examining this. I cannot suggest how to get your code and the bootloader into a single AS7 solution: I'm not familiar with Atmel Studio. But if we could set a watchpoint on the smashed locations, we could then perhaps catch the offending write, and narrow it down to the program or the bootloader code.

heymanrl commented 4 years ago

How many bytes do you see zeroed just above the BOOTPROT region? I think you mentioned 16 bytes in the forum thread. When 16k is protected, 16 bytes at address 0x4000 get cleared Is that always true, and do you see the same number when you vary the size of the BOOTPROT region? No, after looking at the three .hex files I kept Did you see any other bytes cleared? Yes, and very interesting with 120k protected. Contents at locations 0xE000, 0xE200, 0xE400, 0xE600 and 0xE800 have 16 bytes cleared Getting a dump of all of flash and comparing it with an undamaged snapshot would be interesting. I have three .hex files I recorded after a failure. One with 16k (0x4000) protected, 24k protected (0x6000) and 120k protected (0xE000). I did not keep a clean dump (before any testing) to compare the entire memory space before any failure. The .hex files are attached.

If I understand correctly, this happens only with the program above, and you narrowed it down to a particular line in the program, as described in the forum thread. Using Blink or slight variations on your SSD1306 program cause the problem to disappear. Is that correct? Yes. The "my version 7" submitted has some oled display commands inactive and does not fail. The "my version 3" submitted has all oled display commands active and fails.

How are you power-cycling automatically? Are you removing 5V power completely, and how are you doing that, in terms of hardware? Yes, I cycle power input to the 'Bat' pin. I use an IR4427 FET driver IC. The IR4427 can source/sink around 1.5 amps. It accepts 3.3v TTL logic level inputs (but can handle inputs up to Vsupply). I use +9vdc as the supply voltage. I connect one of the channel outputs through a 3.9v zener (to drop the voltage to around 5vdc) to the Feather M4 'Bat' input. I then drive the IR4427 with a 1 Hz, 50% duty cycle signal from a function generator. That cycles power 500ms ON and 500ms OFF.

a. toggling the EN pin on the Feather, which enables/disables the 3.3V regulator (which powers the OLED board as well), and b. toggling the RESET pin, which would maintain the 3.3V power. both worth trying

Thanks again for your perseverance in examining this. I cannot suggest how to get your code and the bootloader into a single AS7 solution: I'm not familiar with Atmel Studio. But if we could set a watchpoint on the smashed locations, we could then perhaps catch the offending write, and narrow it down to the program or the bootloader code. Exactly. That's the best plan going forward.

Power cycle failure docs.zip

dhalbert commented 4 years ago

For reference, "version 7" code that does not crash, copied from https://forums.adafruit.com/viewtopic.php?f=57&t=158718&start=15#p783840

v7.ino.txt

PrinceAli321 commented 4 years ago

Has there been any progress with this use? I am seeing something very similar to heymanrl with my Feather M4 Express.

After a number of power cycles the feather 'forgets' its sketch and needs to be reflashed.

dhalbert commented 4 years ago

This has moved up on my priority list; I want to set up a test rig like heymanrl's.

PrinceAli321 commented 4 years ago

Any updates on this? This issue is still massively affecting us.

On Mon, 20 Jan 2020, 21:07 Dan Halbert, notifications@github.com wrote:

This has moved up on my priority list; I want to set up a test rig like heymanrl's.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/adafruit/uf2-samdx1/issues/95?email_source=notifications&email_token=AB7P3SG4HOBN2DZ62E2S5HTQ6ZKH5A5CNFSM4JYP3Z72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJOID6A#issuecomment-576487928, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7P3SH2EU2OWAVGPS7O7K3Q6ZKH5ANCNFSM4JYP3Z7Q .

dhalbert commented 4 years ago

No updates, but still high priority. Getting CircuitPython 5.0.0 done is first priority. @PrinceAli321 Could you provide any other clues? Does it affect only certain programs? Are you using Arduino or CircuitPython? Do you have other devices connected or can it happen on a bare board? Does it happen only on power cycle or can it happen when the board is reset?

Any simple programs you can supply that exhibit the behavior, together with what is connected to the board (if anything), will help in debugging this. Thanks.

dhalbert commented 4 years ago

@heymanrl I have started working on this. I am running v3.ino from above. Instead of using a FET to power-cycle the board, I am powering it from USB and toggling the EN (enable) pin on the Feather from another board, which enables or disables the 3.3V regulator. I reproduced the problem fairly quickly after some false starts:

  1. Tried duty cycle of 0.4 seconds on and 0.1 seconds off. That didn't trigger it. Needed at least 0.5 and 0.5 seconds.
  2. Tried with the J-Link connected. That didn't work. As soon as I disconnected the J-Link, it took only a few minutes to happen.

I am seeing 16 bytes zeroed at offset 0x4000 (16384) (just past the bootloader), and 8 more bytes zeroed at offset 0x4200 (16384+512). But that's just one sample of the failure.

UPDATE: Second try. I added a couple of lines at the beginning of setup() to turn on D13. In this case, only 3 bytes zeroed, again starting at 0x4000, not 16 bytes.

dhalbert commented 4 years ago

Pinging @PrinceAli321 again. Could you provide any other clues? Does it affect only certain programs? Are you using Arduino or CircuitPython? Do you have other devices connected or can it happen on a bare board? Does it happen only on power cycle or can it happen when the board is reset?

Any simple programs you can supply that exhibit the behavior, together with what is connected to the board (if anything), will help in debugging this. Thanks.

dhalbert commented 4 years ago

@heymanrl I can get v3 to fail after, usually after a few dozen to a few hundred power cycles. And now more interesting, I tried your v7 and it also failed (after 157 cycles), which is not your experience, but is interesting. I was suspicious of it having much to do with the sketch, and that confirms it. I also instrumented the v3 sketch to catch any zeroing, and it failed without hitting any of my checks, so that seems to confirm it is a bootloader problem.

I'm trying various ways of instrumenting the bootloader, but it's difficult due to the power cycling, which upsets the J-Link. Will continue. This running diary is to help me keep track of things as well.

PrinceAli321 commented 4 years ago

Pinging @PrinceAli321 again. Could you provide any other clues? Does it affect only certain programs? Are you using Arduino or CircuitPython? Do you have other devices connected or can it happen on a bare board? Does it happen only on power cycle or can it happen when the board is reset?

Any simple programs you can supply that exhibit the behavior, together with what is connected to the board (if anything), will help in debugging this. Thanks.

Hi Dhalbert,

Things my sketch have in common with @heymanrl's are:

I've not seen this issue on any other feather projects.

I did notice that when power-cycling the feather the pic can be held up by the inputs from A3/A4 and when rebooted with load on A3/A4 the feather ended up in a similar corrupted state (this time the NEO light was stuck on white, otherwise same symptoms). After that i tried sequencing the power to ensure inputs where low while the board started up which seemed to cure the 'white light' problem, but still after power-cycling over and over i'll see the red-light of death. I've only ever noticed the problem on power-cycles.

Hope that helps, let me know if you want to know more

dhalbert commented 4 years ago

@PrinceAli321 Thank you! Are you using any external boards such as a display? I'm trying to see if the display is the culprit, or it's pin reading. I have been making minor variations to @heymanrl's program with seemingly large effects (like skipping a pin read).

PrinceAli321 commented 4 years ago

Nope, not using any external boards/display. Just the 2 DACS and ADCs really.

On Tue, 10 Mar 2020, 21:43 Dan Halbert, notifications@github.com wrote:

@PrinceAli321 https://github.com/PrinceAli321 Thank you! Are you using any external boards such as a display? I'm trying to see if the display is the culprit, or it's pin reading. I have been making minor variations to @heymanrl https://github.com/heymanrl's program with seemingly large effects (like skipping a pin read).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adafruit/uf2-samdx1/issues/95?email_source=notifications&email_token=AB7P3SE7FMBHAO6RNPQQF5TRG2X7FA5CNFSM4JYP3Z72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEONJCOA#issuecomment-597332280, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7P3SHHECGYYX5ZKFC5LADRG2X7FANCNFSM4JYP3Z7Q .

haukehaseler commented 4 years ago

Hello, I just wanted to add a "me too": I use the Itsybitsy m4 quite frequently in different projects. Occasionally, a board forgets its firmware when powered up. It does not seem to happen under different reset conditions. A quick fix of this problem would be great. A microcontroller which loses its firmware is not particularly trustworthy. It would be a shame having to switch to a different board - I think that both the Itsybitsy and the drag-and drop functionality of the bootloader are great. Thank you very much.

haukehaseler commented 4 years ago

Hello, to me, it sounds like it might be a brown-out problem. Looking at the schemtics of the OLED-Wing, it does draw power from the 3V3 regulator and it adds 10uF to GND, which may lead to slow rise times on the 3V3 line. According to the data sheet, the SAMD51 wouldn't like that. The bootloader code does not seem to consider brown-out resets. In main.c, lines 121 to 134, the logic appears to be "go straight to the main application if the reset cause was a POR". Maybe it should be inverted, as in "don't go straight to the main application if the cause was an external reset (EXT)". That way, brown-out and other reset sources circumvent the bootloader. Maybe like this: if (RESET_CONTROLLER->RCAUSE.bit.EXT) { if (*DBL_TAP_PTR == DBL_TAP_MAGIC) { *DBL_TAP_PTR = 0; return; // stay in bootloader } else { if (*DBL_TAP_PTR != DBL_TAP_MAGIC_QUICK_BOOT) { *DBL_TAP_PTR = DBL_TAP_MAGIC; delay(500); } *DBL_TAP_PTR = 0; } } else { *DBL_TAP_PTR = 0; } Does someone have the time to test this? I do not have the necessary hardware here. Best regards.

dhalbert commented 4 years ago

@haukehaseler Thanks for your suggestions, which are very helpful. I am actively working on this, and have also been in touch with MicroChip. I am looking at brownout settings and the RCAUSE settings.

@PrinceAli321 is not using any external boards but still has the problem, so it's not always extra capacitance, though that may increase the chances of a problem.

I have been pruning down @heymanrl's test programs in various ways. Removing various uses of GPIO, DAC, or I2C seem to make the problem go away, which is very odd. If the problem were strictly in the bootloader then the user program should not matter. Power-down, not just power-up, may be part of the issue.

Please feel free to continue to speculate.

haukehaseler commented 4 years ago

Dear Dan,

first of all, thank you for spending so much effort on this problem. Here is what I can report up to now:

The things I will try to test next are:

I hope this helps, best regards,

Hauke

On 20. Mar 2020, at 14:06, Dan Halbert notifications@github.com wrote:

@haukehaseler https://github.com/haukehaseler Thanks for your suggestions, which are very helpful. I am actively working on this, and have also been in touch with MicroChip. I am looking at brownout settings and the RCAUSE settings.

@PrinceAli321 https://github.com/PrinceAli321 is not using any external boards but still has the problem, so it's not always extra capacitance, though that may increase the chances of a problem.

I have been pruning down @heymanrl https://github.com/heymanrl's test programs in various ways. Removing various uses of GPIO, DAC, or I2C seem to make the problem go away, which is very odd. If the problem were strictly in the bootloader then the user program should not matter. Power-down, not just power-up, may be part of the issue.

Please feel free to continue to speculate.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adafruit/uf2-samdx1/issues/95#issuecomment-601690499, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO3HNN6ZRB5CQWWXVPTH6DDRINS6TANCNFSM4JYP3Z7Q.

dhalbert commented 4 years ago

@haukehaseler Thank you. If you are willing to share your programs before and after changes, that would be great. I will look at the assembly output to check the differences.

I worked on this more over the weekend, and am going to start working with the brown-out detectors and their hysteresis settings, and with inserting a short delay on power-up in the bootloader to ensure the power is more stable.

Seemingly random changes in the user program sometimes seem to affect the probably of failure. I have wondered if it has to do with certain instructions being on certain memory boundaries.

I have one board cycling another. I toggle the EN line (3.3V regulator disable) on a Feather M4 with a short power cycling program which also monitors the state of a data pin on the Feather M4. The user program sets that pin high in setup(). The cycling program will stop when it detects that the pin is low, so it's easy to leave it running and then see exactly when it failed.

Ve2mrx commented 4 years ago

Hi everyone,

That problem is so weird that I would expect to see something written about in some µC errata note ;-) I suppose someone checked them just in case?

Just my 0.03$CAN (~0.02$US), Martin

On 2020-03-23 12:09, Dan Halbert wrote:

@haukehaseler https://github.com/haukehaseler Thank you. If you are willing to share your programs before and after changes, that would be great. I will look at the assembly output to check the differences.

I worked on this more over the weekend, and am going to start working with the brown-out detectors and their hysteresis settings, and with inserting a short delay on power-up in the bootloader to ensure the power is more stable.

Seemingly random changes in the user program sometimes seem to affect the probably of failure. I have wondered if it has to do with certain instructions being on certain memory boundaries.

I have one board cycling another. I toggle the EN line (3.3V regulator disable) on a Feather M4 with a short power cycling program which also monitors the state of a data pin on the Feather M4. The user program sets that pin high in |setup()|. The cycling program will stop when it detects that the pin is low, so it's easy to leave it running and then see exactly when it failed.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/adafruit/uf2-samdx1/issues/95#issuecomment-602699334, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESWY6AU5VYBSPEL4R4GBT3RI6CVFANCNFSM4JYP3Z7Q.

dhalbert commented 4 years ago

There's nothing in the errata that's like this, unfortunately. I opened a case with MicroChip and they said they'd only seen this on an M0+ on a board that didn't have a decoupling cap on VDDcore. Our boards have plenty of decoupling caps, but not necessarily the exact ones in their sample schematics.

I have ordered a SAME54 Xplained Pro MicroChip dev board, to see if it can be reproduced on that official board. The chip is essentially the same.

Ve2mrx commented 4 years ago

Well, thanks for looking into this.

As an owner of a Metro M0 Express who had random reboots, I wonder if it was only my bad newbie programming that was the cause now. Probably it was that I thought I was a better programmer than I really was, and did things I didn't fully grasp ;-) At least, I got a J-Link mini EDU to try to find out! I think I fixed it by better managing timing and buffers.

Anyway, good luck and success!

Martin

On 2020-03-23 22:05, Dan Halbert wrote:

There's nothing in the errata that's like this, unfortunately. I opened a case with MicroChip and they said they'd only seen this on an M0+ on a board that didn't have a decoupling cap on VDDcore. Our boards have plenty of decoupling caps, but not necessarily the exact ones in their sample schematics.

I have ordered a SAME54 Xplained Pro MicroChip dev board, to see if it can be reproduced on that official board. The chip is essentially the same.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/adafruit/uf2-samdx1/issues/95#issuecomment-602964684, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESWY6B4RTDVOXXQZSEGPLLRJAIPTANCNFSM4JYP3Z7Q.

haukehaseler commented 4 years ago

Hello everyone, a quick update: Fixing the floating point constants does not remedy the problem - to just makes it less likely to happen. Looking at the 3.3V-regulator output makes me fairly certain that the problem occurs at power-down (The voltage does not ramp down monotonously, but produces a little bump around 1.5V). I have now enabled the brown-out-detection and set the threshold to 2.7V. SUPC->BOD33.bit.LEVEL = 200; // 2.7V: 1.5V + LEVEL * 6mV. SUPC->BOD33.bit.ENABLE = 1; // enable brown-out detection I also found another forum entry describing the problem, and solving it by enabling the BOD: https://www.avrfreaks.net/forum/samd51-not-starting-after-power-cycle If this is true, the problem is not directly related to the bootloader, and there is nothing wrong with the uC itself. It is simply a "bad" power supply, or an unfortunate combination of connected components that run off 3.3V. Why the brown-out leads to zeroed program memory is beyond me - I will stick to my wild guesswork that it has to do with the FPU being active at the moment of power loss. Best, Hauke

dhalbert commented 4 years ago

@haukehaseler Wow, thank you for finding that forum post! I had done a lot of searching but not stumbled upon it.

We do set BOD33 as you showed in CircuitPython, but it is not done in the bootloader. We had seen the spurious flash when CircuitPython was running too, but those failures might have been due to a brief while the bootloader is running, before CircuitPython started. So we should set BOD33 as soon as possible in the bootloader. This was one of the things I was going to try, and I'll now make it the highest priority.

haukehaseler commented 4 years ago

Yes, I now enabled the BOD in the bootloader and yes, I took the relevant three lines of code from a CircuitPython github page (laziness, thy name is programmer). If my long-term tests do not fail now, the issue is solved for me - good luck with any further work.

dhalbert commented 4 years ago

Some initial testing shows raising the BOD33 level to 2.7V appears to fix the flash write problem! I don't have a test bootloader yet for you all yet, though, because this or perhaps some other changes I am also doing are breaking double-tap. Stay tuned.

dhalbert commented 4 years ago

Here is a test bootloader for all of you to try. It uses the BOD33 brownout detector circuitry to busy-wait until the voltage has stabilized above 2.7V for at least 100msecs. Once that point is reached, reset-on-brownout below 2.7V is enabled.

I've been running this with @heymanrl's v3.ino for over 18000 cycles without failure.

Simply setting the brownout-on-reset immediately on startup doesn't work as well; that can cause multiple resets while powering-up, which confuses the double-click detection software.

Unfortunately we also cannot set the fuses to enable the brownout detector automatically, because of a SAMD51 erratum: if BOD33 is enabled in the fuses, it can make it impossible to connect a debugger to the chip.

Here's a bootloader updater to try for Feather M4. Unzip the file below to get a .uf2. Then double-click to get the BOOT drive, and then drag the .uf2 onto it. Thank you all for your help and testing.

update-bootloader-feather_m4-v3.7.0.uf2.zip