Failed `factory_reset` leaves device in state unable to be updated

betrusted-io / xous-core

The Xous microkernel

Apache License 2.0

531 stars 85 forks source link

Failed `factory_reset` leaves device in state unable to be updated #113

Closed nitz closed 2 years ago

nitz commented 2 years ago

I was using the simple factory_reset.ps1 to wipe/update my freshly received Precursor. During one of the steps (I believe when it was doing python usb_update.py --soc soc_csr.bin --force,) the erase succeeded, but the write failed around 75%.

Since then, any attempt to perform any action with usb_update.py causes the error:

Could not find a valid csr.csv descriptor on the device, aborting!

I've re-built the full image using cargo xtask hw-image precursors/soc.svd, but any of the tasks to stage the kernel or loader result in the same error as running usb_update.py directly.

Attempting to reset the SoC via cargo xtask nuke-soc fails with the following output:

    Finished dev [unoptimized + debuginfo] target(s) in 0.27s
     Running `target\debug\xtask.exe nuke-soc`
Installing factory-reset SoC gateware (secrets will be lost)!
Error: Error { kind: NotFound, message: "program not found" }
error: process didn't exit successfully: `target\debug\xtask.exe nuke-soc` (exit code: 1)

From shellchat on the device, ver soc and ver xous both report 0.9.7-*. However, ver ec Reports 0.9.5, so it definitely never updated.

Attempting ecup auto (Or ecup fw) fails immediately as it doesn't find a staged image.

Is there a proper way to reset csr.csv on the device, or to start from scratch ignoring what's (not?) there now?

nitz commented 2 years ago

I was able to use the "Failsafe" method to flash the EC firmware, which now reports 0.9.7-2 when I run ver ec.

However, the main problem still persists that usb_update.py fails with the csr.csv error when I attempt any flashing operation.

I did attempt to run provision_xous.sh when I had the debug cable connected, but it failed for each image as follows:

pi@0w-0:~/precursor/betrusted-scripts $ sudo ./provision_xous.sh
WARNING: this script is for un-bricking devices, and will overwrite any secret keys stored in the gateware
4f774a3b0d607622f5952fada6799523  ../precursors/soc_csr.bin
27f5183070f7bc831b82a284df48d856  ../precursors/loader.bin
e482ef9b1b322bcc1f776687b680ec77  ../precursors/xous.img
This script does a factory reset. You will lose your root keys if you update the FPGA. Proceed? (y/N) y
Programming .bin file to FPGA: ../../precursors/soc_csr.bin
Using JTAGSPI mode with variant 'xc7s50', to address 0x00000000, verify is on
Traceback (most recent call last):
  File "./jtag_gpio.py", line 1417, in <module>
    main()
  File "./jtag_gpio.py", line 1093, in main
    do_spi_bitstream(ifile, jtagspi=args.jtagspi_variant, address=args.address, verify=verify, do_reset=args.reset_prog, raw_binary=args.raw_binary)
  File "./jtag_gpio.py", line 489, in do_spi_bitstream
    virtualflash = Mx66umFlashDevice(virtualspi, id)
  File "/home/pi/precursor/betrusted-scripts/jtag-tools/serialflash.py", line 1020, in __init__
    raise SerialFlashUnknownJedec(jedec)
serialflash.SerialFlashUnknownJedec: Unknown flash device: b'ffffff'
Programming .bin file to FPGA: ../../precursors/loader.bin
Using JTAGSPI mode with variant 'xc7s50', to address 0x00500000, verify is on
Traceback (most recent call last):
  File "./jtag_gpio.py", line 1417, in <module>
    main()
  File "./jtag_gpio.py", line 1093, in main
    do_spi_bitstream(ifile, jtagspi=args.jtagspi_variant, address=args.address, verify=verify, do_reset=args.reset_prog, raw_binary=args.raw_binary)
  File "./jtag_gpio.py", line 489, in do_spi_bitstream
    virtualflash = Mx66umFlashDevice(virtualspi, id)
  File "/home/pi/precursor/betrusted-scripts/jtag-tools/serialflash.py", line 1019, in __init__
    if not Mx66umFlashDevice.match(jedec):
  File "/home/pi/precursor/betrusted-scripts/jtag-tools/serialflash.py", line 614, in match
    manufacturer, device, capacity = jedec[:3]
ValueError: not enough values to unpack (expected 3, got 1)
Programming .bin file to FPGA: ../../precursors/xous.img
Using JTAGSPI mode with variant 'xc7s50', to address 0x00980000, verify is off
Traceback (most recent call last):
  File "./jtag_gpio.py", line 1417, in <module>
    main()
  File "./jtag_gpio.py", line 1093, in main
    do_spi_bitstream(ifile, jtagspi=args.jtagspi_variant, address=args.address, verify=verify, do_reset=args.reset_prog, raw_binary=args.raw_binary)
  File "./jtag_gpio.py", line 489, in do_spi_bitstream
    virtualflash = Mx66umFlashDevice(virtualspi, id)
  File "/home/pi/precursor/betrusted-scripts/jtag-tools/serialflash.py", line 1020, in __init__
    raise SerialFlashUnknownJedec(jedec)
serialflash.SerialFlashUnknownJedec: Unknown flash device: b'ffffff'

I thought it might have something to do with the fact that I'm using a Pi 0, so I modified the calls to jtag_gpio.py To use -c, but the result was the same, just slowly.

nitz commented 2 years ago

Okay, might be making some progress.

I grabbed the csr.csv from one of the latest CI builds, here: https://ci.betrusted.io/job/betrusted-soc/lastBuild/artifact/build/

Stuck that in my tools folder, and modified usb_update.py as follows:

diff --git "a/tools/usb_update.py" "b/tools/usb_update.py"
index 4de24216..0281af0c 100644
--- "a/tools/usb_update.py"
+++ "b/tools/usb_update.py"
@@ -206,10 +206,12 @@ class PrecursorUsb:
         digest = hasher.digest()
         if digest != csr_data[0x7fc0:]:
             print("Could not find a valid csr.csv descriptor on the device, aborting!")
-            exit(1)
+            #exit(1)

-        csr_len = int.from_bytes(csr_data[:4], 'little')
-        csr_extracted = csr_data[4:4+csr_len]
+        #csr_len = int.from_bytes(csr_data[:4], 'little')
+        #csr_extracted = csr_data[4:4+csr_len]
+        csr_extracted_f = open("csr.csv", "rb")
+        csr_extracted = csr_extracted_f.read()
         decoded = csr_extracted.decode('utf-8')
         # strip comments
         stripped = []

Using those manually loaded register addresses seems like it's getting me close enough to be able to do some things.

I tried the flash that failed from factory_reset.ps1, and it again failed. Notice the "100%" line after the trackback:

╰─ .\usb_update.py --soc .\soc_csr.bin --force                                                                                                                                       ─╯
Using SoC  registers
Halting CPU.
Programming SoC gateware .\soc_csr.bin
ID code bytes 1-2: 0x8080c2c2
ID code bytes 2-3: 0x3b3b8080
Erasing 100% (2621440 of 2621440) |##############################################################################################################| Elapsed Time: 0:00:10 Time:  0:00:10
Erase finished
Writing  88% (2316800 of 2621440) |#################################################################################################             | Elapsed Time: 0:00:19 ETA:   0:00:02Traceback (most recent call last):
  File "C:\Users\cmd\source\repos\xous-core\tools\usb_update.py", line 637, in <module>
    main()
  File "C:\Users\cmd\source\repos\xous-core\tools\usb_update.py", line 587, in main
    pc_usb.flash_program(locs['LOC_SOC'][0], image, verify=verify)
  File "C:\Users\cmd\source\repos\xous-core\tools\usb_update.py", line 362, in flash_program
    self.burst_write(flash_region, data[written:(written+chunklen)])
  File "C:\Users\cmd\source\repos\xous-core\tools\usb_update.py", line 128, in burst_write
    numwritten = self.dev.ctrl_transfer(bmRequestType=(0x00 | 0x43), bRequest=0,
  File "C:\Python310\lib\site-packages\usb\core.py", line 1082, in ctrl_transfer
    ret = self._ctx.backend.ctrl_transfer(
  File "C:\Python310\lib\site-packages\usb\backend\libusb0.py", line 601, in ctrl_transfer
    return _check(_lib.usb_control_msg(
  File "C:\Python310\lib\site-packages\usb\backend\libusb0.py", line 447, in _check
    raise USBError(errmsg, ret)
usb.core.USBError: [Errno None] b'libusb0-dll:err [control_msg] sending control message failed, win error: A device attached to the system is not functioning.\r\n\n'
Writing 100% (2621440 of 2621440) |##############################################################################################################| Elapsed Time: 0:00:20 Time:  0:00:20

I then tried: usb_update.py --factory-new. The erase completed successfully, but once again, during the write, it fails seemign to lose the USB device:

╰─ .\usb_update.py --factory-new --force                                                                                                                                             ─╯
Could not find a valid csr.csv descriptor on the device, aborting!
Using SoC  registers
Halting CPU.
ID code bytes 1-2: 0x8080c2c2
ID code bytes 2-3: 0x3b3b8080
Erasing 100% (134217728 of 134217728) |##########################################################################################################| Elapsed Time: 0:08:29 Time:  0:08:29
Erase finished
retrieving https://ci.betrusted.io/releases/v0.9.5/soc_csr.bin
burning at 0
ID code bytes 1-2: 0x8080c2c2
ID code bytes 2-3: 0x3b3b8080
Erasing 100% (2621440 of 2621440) |##############################################################################################################| Elapsed Time: 0:00:10 Time:  0:00:10
Erase finished
Writing  85% (2243840 of 2621440) |##############################################################################################                | Elapsed Time: 0:00:18 ETA:   0:00:03Traceback (most recent call last):
  File "C:\Users\cmd\source\repos\xous-core\tools\usb_update.py", line 637, in <module>
    main()
  File "C:\Users\cmd\source\repos\xous-core\tools\usb_update.py", line 622, in main
    pc_usb.flash_program(sections[0], image, verify=False)
  File "C:\Users\cmd\source\repos\xous-core\tools\usb_update.py", line 362, in flash_program
    self.burst_write(flash_region, data[written:(written+chunklen)])
  File "C:\Users\cmd\source\repos\xous-core\tools\usb_update.py", line 128, in burst_write
    numwritten = self.dev.ctrl_transfer(bmRequestType=(0x00 | 0x43), bRequest=0,
  File "C:\Python310\lib\site-packages\usb\core.py", line 1082, in ctrl_transfer
    ret = self._ctx.backend.ctrl_transfer(
  File "C:\Python310\lib\site-packages\usb\backend\libusb0.py", line 601, in ctrl_transfer
    return _check(_lib.usb_control_msg(
  File "C:\Python310\lib\site-packages\usb\backend\libusb0.py", line 447, in _check
    raise USBError(errmsg, ret)
usb.core.USBError: [Errno None] b'libusb0-dll:err [control_msg] sending control message failed, win error: A device attached to the system is not functioning.\r\n\n'
Writing 100% (2621440 of 2621440) |##############################################################################################################| Elapsed Time: 0:00:18 Time:  0:00:18

Only this time, since --factory-new does the full erase, I now hard lock in the bootloader with sig version mismatch.

Guess I've gotta get the rpi back out.

I think I'm barking up the right tree trying to flash the soc_csr at 0x00000000, right?

bunnie commented 2 years ago

Hi @nitz, sorry for the delay in getting back to you. Sorry the update didn't go smoothly for you. I wonder what caused it to abort in the first place.

So: a few things going on.

The missing/corrupt csr.csv is a shadow copy of the register set that is stashed at the very highest address of the gateware region. It's included because the USB updater needs to refer to that to know where to poke to write stuff to FLASH, since it's possible that can change between SoC builds.
What happened when the burn aborted at ~75% is your FPGA image burned (great!) but the csr.csv image didn't (oops).
So there is a provision for a --force option to usb_update.py which, I just noticed, incorrectly states it will try 0.8 offsets; it actually tries the 0.9 offsets, which is what you want because you have a 0.9-era image. Originally, this would be the approach I would recommend you take -- just tell it to --force, which would tell it to ignore the corrupted csr.csv area and maybe it would have recovered gracefully.

The summary of where you are so far is, from my understanding:

You've tried to do a usb_update.py twice now, and in both cases, you had it crap out around the 75-80% mark
You're unable to get the failsafe mode to work because you're getting an unknown device error with a code of 0xffffffff.
The last thing you were able to do was an erase of the SoC gateware, but no burn, which means now, you have no USB interface because the FPGA has lost its gateware image. Thus the only recourse is to try to recover via the failsafe method.

There's two things we need to debug, one is why you're unable to finish the USB update, and second, why you're not able to get the failsafe method to work. The failsafe issue is more urgent, but the USB update may be easier to diagnose. For the USB issue I'd ask that you try a different cable and/or host to eliminate a configuration issue, and then if you don't mind sharing the details of your host and cable I'd like to know what you're using so we can try to reproduce the error locally and further understand how to improve it. However, this is not possible until we can get you past the failsafe failure.

For the failsafe issue, a syndrome of 0xffffffff on the idcode readback would normally make me think there is a chance your cable is not firmly inserted/aligned, and/or there is an issue with the debug board. 0xffffffff is what you would get if a pin went floating and was simply not connected (open solder fault).

That being said, you're using a Pi 0, and it's possible that the pin configurations are quite different on that versus the Pi 3 that we have been testing on. The good news is -- I have a Pi 0 here on hand. The bad news is, I've never booted it up. So, I will try to configure this to reproduce your issue on the Pi 0, and report back once I have some further conclusions.

bunnie commented 2 years ago

OK. I do have a Pi 0 hooked up, and I can confirm that I'm able to access the FPGA using the default configs within the provision_xous.sh script inside the betrusted_scripts repo.

A couple of quetions:

Did you try running ./vbus.sh 1 to power on the device? You will need some form of persistent power, either from the Rpi, or from the Precursor, to keep the device up because during programming it will try to turn itself off. You can also alternatively plug power into the USB port of the device but don't simultaneously do ./vbus.sh 1 because then you'd have the power source of the Precursor fighting with the power source of the Pi 0.
I noticed that I had to solder on the header for my Pi 0. If you had to solder on your header, please check the quality of all the solder joints -- it wasn't a straight shot to solder it because the ground planes on the Pi 0 suck the heat out of the iron and it takes a bit of patience to get a good joint on every pin.

bunnie commented 2 years ago

Alright, I just encountered the exact 0xffffffff error you saw, and it was a poorly seated flex cable that was the root cause. If you have any finger oil or gunk on the cable or it's not perfectly aligned it can lead to this issue. I basically just flipped up the FPC connector and wiggled the FPC in the socket and made sure it was well-aligned and pushed the clip back down and it worked again.

nitz commented 2 years ago

Hey I appreciate all you've already put into helping me try to figure this out!

Hi @nitz, sorry for the delay in getting back to you. Sorry the update didn't go smoothly for you. I wonder what caused it to abort in the first place.

No worries, this is half the fun, breaking my new things :)

So there is a provision for a --force option to usb_update.py

Force wasn't able to work for me, even if I skipped the csr.csv loading, because flash_program used some addresses for erasing it seems before it got to writing.

I'm just sitting back down at my desk, so I'll see if I can use a cotton swab to clean the ends of the FPC and give that a shot again.

For power, I was only using power from the battery. I've physically disconnected the battery now (since I didn't want it draining sitting on the bootloader failure all night), so I'll try with USB connected now.

Last night I tried a few more times with the provision_xous.sh, and was able to actually get two different responses. The 0xffffff as well as a case where it raised an exception due to jedec only having 1 value. (printing it out, that value appeared to be a null byte.) I dug into the script a bit more and saw the bit about how it was originally written on a pi 3, so I went and googled around a bit. According to one random commenter I found, the pi0 used the same original MMIO base address as the Pi 1 & 2, so I gave a shot hardcoding looking for "TYPE" == "Zero W" to see if that made a difference, but it didn't seem to. All that said is, I do think that something here is what's going on with the failsafe bit, as I'm guessing there's just something unhappy about the SPI pins or periph. from the Pi.

I've got a Pi 4 somewhere around here. I stuck it in a closet running something, so I just have to physically find it and I can use that as an alternate. the Pi0w I have came with the headers soldered on, but it definitely could be a poor interconnect there, as I had to use a set of risers.

nitz commented 2 years ago

An update: Welp, couldn't find my pi4, but did find my pi400. 🤷🏻‍♀️ Cloned the repos and grabbed the images on it.

I had an "off by one" issue putting the hat on, that I think was inadvertently back-powering the Pi from the USB through the precursor, because when I went to adjust the FPC the small cluster of ICs near the ZIF was hot. Correctly reseated it though, and it looks like flashing either the EC or the WFX images works peachy.

Now back to the fun stuff.

I'm no longer getting the 0xffffff, and am rather just getting the manufacturer, device, capacity = jedec[:3] \\ ValueError: not enough values to unpack (expected 3, got 1). (I'm just directly running jtag_gpio.py via: sudo ./jtag_gpio.py -f ../../precursors/soc_csr.bin --raw-binary --spi-mode -r)

Googled again and found this thread about how the BCM2711's GPIO is at 0x7E200000. I printed my GPIO.RPI_INFO, and found it reporting: {'P1_REVISION': 3, 'REVISION': 'c03130', 'TYPE': 'Unknown', 'MANUFACTURER': 'Sony', 'PROCESSOR': 'BCM2711', 'RAM': '4G'}, so as a duct tape solution, I modified jtag_gpio.py as follows:

diff --git a/jtag_gpio.py b/jtag_gpio.py
index fa2af21..e3c93f4 100755
--- a/jtag_gpio.py
+++ b/jtag_gpio.py
@@ -1045,12 +1045,19 @@ def main():
                     nky_hmac = line[2].rstrip().rstrip(';')

     rev = GPIO.RPI_INFO
+    print(rev)
     if rev['P1_REVISION'] == 1:
         gpio_pointer = pi_mmio_init(0x20000000)
+        print("Using MMIO base: 0x20000000")
+    elif rev['P1_REVISION'] == 3 and rev['PROCESSOR'] == 'BCM2711':
+        gpio_pointer = pi_mmio_init(0x7E000000)
+        print("Using MMIO base: 0x7E000000")
     elif rev['P1_REVISION'] == 3 or rev['P1_REVISION'] == 2:
         gpio_pointer = pi_mmio_init(0x3F000000)
+        print("Using MMIO base: 0x3F000000")
     elif rev['P1_REVISION'] == 4:
         gpio_pointer = pi_mmio_init(0xFE000000)
+        print("Using MMIO base: 0xFE000000")
     else:
         print("Unknown Raspberry Pi rev, can't set GPIO base")
         compat = True

But still no joy:

pi@pi400:~/precursor/betrusted-scripts/jtag-tools $ sudo ./jtag_gpio.py -f ../../precursors/soc_csr.bin --raw-binary --spi-mode -r
{'P1_REVISION': 3, 'REVISION': 'c03130', 'TYPE': 'Unknown', 'MANUFACTURER': 'Sony', 'PROCESSOR': 'BCM2711', 'RAM': '4G'}
Using MMIO base: 0x7E000000
Programming .bin file to FPGA: ../../precursors/soc_csr.bin
Using JTAGSPI mode with variant 'xc7s50', to address 0x00000000, verify is on
Traceback (most recent call last):
  File "./jtag_gpio.py", line 1424, in <module>
    main()
  File "./jtag_gpio.py", line 1100, in main
    do_spi_bitstream(ifile, jtagspi=args.jtagspi_variant, address=args.address, verify=verify, do_reset=args.reset_prog, raw_binary=args.raw_binary)
  File "./jtag_gpio.py", line 489, in do_spi_bitstream
    virtualflash = Mx66umFlashDevice(virtualspi, id)
  File "/home/pi/precursor/betrusted-scripts/jtag-tools/serialflash.py", line 1019, in __init__
    if not Mx66umFlashDevice.match(jedec):
  File "/home/pi/precursor/betrusted-scripts/jtag-tools/serialflash.py", line 614, in match
    manufacturer, device, capacity = jedec[:3]
ValueError: not enough values to unpack (expected 3, got 1)

nitz commented 2 years ago

Well, according to /proc/iomem:

pi@pi400:~/precursor/betrusted-scripts/jtag-tools $ cat /proc/iomem
<--snip-->
00000000-00000000 : fe104000.rng rng@7e104000
00000000-00000000 : fe200000.gpio gpio@7e200000
00000000-00000000 : serial@7e201000
  00000000-00000000 : fe201000.serial serial@7e201000
00000000-00000000 : fe204000.spi spi@7e204000
00000000-00000000 : fe215000.aux aux@7e215000
00000000-00000000 : fe300000.mmcnr mmcnr@7e300000
00000000-00000000 : fe340000.mmc mmc@7e340000
00000000-00000000 : fe804000.i2c i2c@7e804000
<--snip-->

Looks like It should be using 0xFE000000. (So still not the 0x3F000000 that revision 3 would imply.) Unfortunately the behavior with that address was the same. I'm wondering if I've got something funny config'd with this install that's preventing the use of those pins. (iirc it was a full "desktop" setup, but I'm just using it headless now.)

bunnie commented 2 years ago

Catching up -- one thing I'd note right away is that with the battery disconnected, the system will spontaneously reboot itself periodically, which would interrupt any updating process. The battery needs to be connected for everything to work smoothly.

Basically with the battery disconnected but powered by USB, eventually the charger decides something is terribly wrong and moves to hard cutoff power. This happens, but then the presence of USB power revives it. So you effectively see a reboot glitch.

bunnie commented 2 years ago

Eep....off by one on the RPi header...could be bad. That could damage the FPGA, let me have a look tho.

bunnie commented 2 years ago

Hm most likely you would have damaged something on the debug board. I don't see a path directly from the Rpi to any of the FPGA pins, but, the worst case is you shunted 5V into the Rpi Tx/Rx and/or overstressed some 3.3V regulators. But 5V directly into any FPGA pins would likely be Bad.

nitz commented 2 years ago

Makes sense on the charger/battery circuit, so I've gone ahead and re-attached it.

And yeah I think you're right, I imagine that was likely what was super-heating when I discovered it (I unplugged it quickly rather than sticking around and finding out)

If I did pop the FPGA, I wouldn't be able to know yet, right? My assumption is that this flashing is supposed to be going over spi directly to the flash? (so assuming I didn't murk the flash itself, until I actually get bits on it again, the fpga wouldn't have a bitstream to run as is?)

nitz commented 2 years ago

Once I re-attached the battery it does at least get to the bootloader sig failure again (though I imagine that's handled by the EC, as it's what drives the display right?)

So no real harm (yet 🙃)!

bunnie commented 2 years ago

If the display came back the SoC is at least up and running. I'm trying to find a way for you to diagnose the JTAG pins.

bunnie commented 2 years ago

Can you try this:

pi@betrusted-dev:~/code/betrusted-scripts/jtag-tools $ sudo ./jtag_gpio.py -f jtag.jtg -d -r
Executing .jtg command file: jtag.jtg
DEBUG:root:start: [<JtagLeg.RS: 2>, '0', '0'] () / 0
DEBUG:root:tms reset
DEBUG:root:start: [<JtagLeg.IR: 1>, '001001', ' id'] (IDCODE) /  id
DEBUG:root:result: 0x11
DEBUG:root:start: [<JtagLeg.DR: 0>, '0000000000000000000000000000000000000000000000000000000000000000', ' '] () /
DEBUG:root:result: 0x362f093
DEBUG:root:start: [<JtagLeg.IR: 1>, '110010', ' dna'] (FUSE_DNA) /  dna
DEBUG:root:result: 0x11
DEBUG:root:start: [<JtagLeg.DR: 0>, '0000000000000000000000000000000000000000000000000000000000000000', ' '] () /
DEBUG:root:result: 0x3a131a2a73ad3273
DEBUG:root:start: [<JtagLeg.IR: 1>, '110011', ' user'] (FUSE_USER) /  user
DEBUG:root:result: 0x11
DEBUG:root:start: [<JtagLeg.DR: 0>, '00000000000000000000000000000000', ' '] () /
DEBUG:root:result: 0x0
DEBUG:root:start: [<JtagLeg.IR: 1>, '110001', ' key'] (FUSE_KEY) /  key
DEBUG:root:result: 0x11
DEBUG:root:start: [<JtagLeg.DR: 0>, '0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000', ' '] () /
DEBUG:root:result: 0x0
DEBUG:root:start: [<JtagLeg.IR: 1>, '110100', ' cntl'] (FUSE_CNTL) /  cntl
DEBUG:root:result: 0x11
DEBUG:root:start: [<JtagLeg.DR: 0>, '00000000000000', ' '] () /
DEBUG:root:result: 0xc0

nitz commented 2 years ago

Interestingly after I've reattached the battery, I'm back to the 'ffffff' on the flash device.

Here's the result of that run:

pi@pi400:~/precursor/betrusted-scripts/jtag-tools $ sudo ./jtag_gpio.py -f jtag.jtg -d -r
{'P1_REVISION': 3, 'REVISION': 'c03130', 'TYPE': 'Unknown', 'MANUFACTURER': 'Sony', 'PROCESSOR': 'BCM2711', 'RAM': '4G'}
Using MMIO base: 0xFE000000
Executing .jtg command file: jtag.jtg
DEBUG:root:start: [<JtagLeg.RS: 2>, '0', '0'] () / 0
DEBUG:root:tms reset
DEBUG:root:start: [<JtagLeg.IR: 1>, '001001', ' id'] (IDCODE) /  id
DEBUG:root:result: 0x11
DEBUG:root:start: [<JtagLeg.DR: 0>, '0000000000000000000000000000000000000000000000000000000000000000', ' '] () /
DEBUG:root:result: 0x362f093
DEBUG:root:start: [<JtagLeg.IR: 1>, '110010', ' dna'] (FUSE_DNA) /  dna
DEBUG:root:result: 0x11
DEBUG:root:start: [<JtagLeg.DR: 0>, '0000000000000000000000000000000000000000000000000000000000000000', ' '] () /
DEBUG:root:result: 0x2a157c9854a25c57
DEBUG:root:start: [<JtagLeg.IR: 1>, '110011', ' user'] (FUSE_USER) /  user
DEBUG:root:result: 0x11
DEBUG:root:start: [<JtagLeg.DR: 0>, '00000000000000000000000000000000', ' '] () /
DEBUG:root:result: 0x0
DEBUG:root:start: [<JtagLeg.IR: 1>, '110001', ' key'] (FUSE_KEY) /  key
DEBUG:root:result: 0x11
DEBUG:root:start: [<JtagLeg.DR: 0>, '0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000', ' '] () /
DEBUG:root:result: 0x0
DEBUG:root:start: [<JtagLeg.IR: 1>, '110100', ' cntl'] (FUSE_CNTL) /  cntl
DEBUG:root:result: 0x11
DEBUG:root:start: [<JtagLeg.DR: 0>, '00000000000000', ' '] () /
DEBUG:root:result: 0xc0

Edit to add: I ran it using the different MMIO base addresses too, and the output was the same for each.

bunnie commented 2 years ago

oh. interesting. so, this shows that the JTAG ID and the DNA are reading out just fine. so good news is probably the device is not heavily damaged. hm but the ffffffff error. let me look at that message a bit more...

nitz commented 2 years ago

Oh wait a moment here, I wasn't passing -r when trying to write the soc_csr.bin, but now that I am I'm running into python issues because apparently I installed progressbar2 in user space not root. This may be uh... progress.

nitz commented 2 years ago

Eyyyy!

{'P1_REVISION': 3, 'REVISION': 'c03130', 'TYPE': 'Unknown', 'MANUFACTURER': 'Sony', 'PROCESSOR': 'BCM2711', 'RAM': '4G'}
Using MMIO base: 0xFE000000
Programming .bin file to FPGA: ../../precursors/soc_csr.bin
Using JTAGSPI mode with variant 'xc7s50', to address 0x00000000, verify is on
Using erase block size of 65536 bytes
Using erase block size of 65536 bytes, erasing 40 blocks from 0x00000000 to 0x00280000 (rounded up from 0x00280000)
Erasing 100% (2621440 of 2621440) |#######################################################################################################################################################################################################################| Elapsed Time: 0:00:14 Time:  0:00:14
Programming 100% (2621440 of 2621440) |###################################################################################################################################################################################################################| Elapsed Time: 0:00:07 Time:  0:00:07
Reading back data for verification...
Comparing data...
Programming concluded

Gonna flash the rest and see how it goes.

bunnie commented 2 years ago

eeeeyyyyyy!!! :100:

nitz commented 2 years ago

Alright, looks like flashing each bit is happy. (though it skips verifying the xous image). Still didn't want to boot so I grabbed stable images rather than CI just to see, and flashed again:

pi@pi400:~/precursor/betrusted-scripts $ sudo ./provision_xous.sh
WARNING: this script is for un-bricking devices, and will overwrite any secret keys stored in the gateware
4f774a3b0d607622f5952fada6799523  ../precursors/soc_csr.bin
27f5183070f7bc831b82a284df48d856  ../precursors/loader.bin
e482ef9b1b322bcc1f776687b680ec77  ../precursors/xous.img
This script does a factory reset. You will lose your root keys if you update the FPGA. Proceed? (y/N) y
{'P1_REVISION': 3, 'REVISION': 'c03130', 'TYPE': 'Unknown', 'MANUFACTURER': 'Sony', 'PROCESSOR': 'BCM2711', 'RAM': '4G'}
Using MMIO base: 0xFE000000
Programming .bin file to FPGA: ../../precursors/soc_csr.bin
Using JTAGSPI mode with variant 'xc7s50', to address 0x00000000, verify is on
Using erase block size of 65536 bytes
Using erase block size of 65536 bytes, erasing 40 blocks from 0x00000000 to 0x00280000 (rounded up from 0x00280000)
Erasing 100% (2621440 of 2621440) |######################################################################################| Elapsed Time: 0:00:16 Time:  0:00:16
Programming 100% (2621440 of 2621440) |##################################################################################| Elapsed Time: 0:00:07 Time:  0:00:07
Reading back data for verification...
Comparing data...
Programming concluded
{'P1_REVISION': 3, 'REVISION': 'c03130', 'TYPE': 'Unknown', 'MANUFACTURER': 'Sony', 'PROCESSOR': 'BCM2711', 'RAM': '4G'}
Using MMIO base: 0xFE000000
Programming .bin file to FPGA: ../../precursors/loader.bin
Using JTAGSPI mode with variant 'xc7s50', to address 0x00500000, verify is on
Using erase block size of 65536 bytes
Using erase block size of 65536 bytes, erasing 20 blocks from 0x00500000 to 0x00640000 (rounded up from 0x00639048)
Erasing 100% (6553600 of 6553600) |######################################################################################| Elapsed Time: 0:00:05 Time:  0:00:05
Programming 100% (1282120 of 1282120) |##################################################################################| Elapsed Time: 0:00:03 Time:  0:00:03
Reading back data for verification...
Comparing data...
Programming concluded
{'P1_REVISION': 3, 'REVISION': 'c03130', 'TYPE': 'Unknown', 'MANUFACTURER': 'Sony', 'PROCESSOR': 'BCM2711', 'RAM': '4G'}
Using MMIO base: 0xFE000000
Programming .bin file to FPGA: ../../precursors/xous.img
Using JTAGSPI mode with variant 'xc7s50', to address 0x00980000, verify is off
Using erase block size of 65536 bytes
Using erase block size of 65536 bytes, erasing 94 blocks from 0x00980000 to 0x00f60000 (rounded up from 0x00f5fdb8)
Erasing 100% (16121856 of 16121856) |####################################################################################| Elapsed Time: 0:00:30 Time:  0:00:30
Programming 100% (6159800 of 6159800) |##################################################################################| Elapsed Time: 0:00:17 Time:  0:00:17
Programming concluded

Screen on the precursor is still halting during boot:

Betrusted/Precursor Bootloader v0.2.1
DEVELOPER KEY DETECTED
Signature check failed, powering down.

It was my understanding that the CI builds are signed with the blank key, so I'm guessing that's not the signature it's referring to?

Edit: the spartan7 is actually producing the slightest bit of warmth now, where as it has been sitting cold before, so I'm guessing it's probably actually loading the bitstream.

bunnie commented 2 years ago

OK so... I'm guessing the screen is all black basically, with just that text at the top?

This is failing to do a sigcheck on the loader -- the bottom half of the screen is where the loader prints, and the top half is where the FPGA/SoC's internal loader prints.

There's a couple options here.

Something is somehow borked in the loader image. Can you confirm the md5sum of your loader.bin? it should be 27f5183070f7bc831b82a284df48d856 loader.bin This is the md5sum of the latest release from https://ci.betrusted.io/releases/latest/
Something is borked in the actual SPINOR flash itself that's preventing it from snapping into high speed mode and doing the readout of data (some kind of corruption on the bus maybe?)
- When you received the device, did you try powering it on at all before doing the update? If so, was it able to boot to something beyond that screen?
- That it can load the SoC and also do programming means at least a few of the high speed data pins are wired correctly, but shortly after boot the device goes into an octal-width mode and maybe we're looking at the possibility of one of those having gone bad. However, if it was able to boot to something before when you received it, that seems unlikely.

bunnie commented 2 years ago

Alright I have to head off to bed now -- please let me know what the result is for checking the loader md5sum once you get a chance. If that checks out and you system was able to boot at least once before, if you have the time and patience I might send you some custom soc_csr.bin images to try and wade through what's happening. Basically you're stuck in the ROM bootloader inside the SoC, and it's coded to be "paranoid" and prevent any boots if the loader seems tampered. I'd flip a bit to turn that off and also add some debug spew so we can see a bit more of what's going on.

nitz commented 2 years ago

OK so... I'm guessing the screen is all black basically, with just that text at the top?

Correct! It never makes it to the second stage loader that draws halfway down the screen.

Here's my image sums:

pi@pi400:~/precursor/precursors $ md5sum *.bin
8b8f8d6afa9beefae92a6488a3d744dd  bt-ec.bin
9a311d89dafe317eac77510e5ea6849f  ec_fw.bin
27f5183070f7bc831b82a284df48d856  loader.bin
4f774a3b0d607622f5952fada6799523  soc_csr.bin
9541a3eee2d1062f4e205f4814738ce9  wf200_fw.bin

When you received the device, did you try powering it on at all before doing the update? If so, was it able to boot to something beyond that screen?

Yep! Even after the partial/failed update it still booted and was mostly usable (with the components out of sync, it wouldn't let me use the radio and such.)

Alright I have to head off to bed now -- please let me know what the result is for checking the loader md5sum once you get a chance. If that checks out and you system was able to boot at least once before, if you have the time and patience I might send you some custom soc_csr.bin images to try and wade through what's happening.

Sure thing! I appreciate all the help too. I know the last thing you wanted to hear was someone immediately broke things 😅

In the meantime, I modified jtag_gpio.py again so I could use the --read flag: (looks like it wasn't implicitly converting from strings)

diff --git a/jtag_gpio.py b/jtag_gpio.py
index fa2af21..3b12a7f 100755
--- a/jtag_gpio.py
+++ b/jtag_gpio.py
@@ -548,6 +548,9 @@ def read_spi_bitstream(ofile, jtagspi='xc7s50', address=0, read_len=0x280000, do
     from serialflash import Mx66umFlashDevice
     global jtag_legs

+    address = int(address, 0)
+    read_len = int(read_len, 0)
+
     if address >= 0x1000000:
         print("Only 3-byte addressing supported, address 0x{:0x} too large. Quitting.".format(address))
         exit(1)

I then ran:

pi@pi400:~/precursor/betrusted-scripts/jtag-tools $ ./jtag_gpio.py --read -f ../../precursors/loader-read.bin --read-len 1282120 --read-addr 0x00500000
{'P1_REVISION': 3, 'REVISION': 'c03130', 'TYPE': 'Unknown', 'MANUFACTURER': 'Sony', 'PROCESSOR': 'BCM2711', 'RAM': '4G'}
Using MMIO base: 0xFE000000
Executing .jtg command file: ../../precursors/loader-read.bin
Reading back data...
Read concluded

And sure enough, what I read back doesn't match what I wrote:

pi@pi400:~/precursor/betrusted-scripts/jtag-tools $ md5sum ../../precursors/load*
27f5183070f7bc831b82a284df48d856  ../../precursors/loader.bin
beaf3c307994cf7536eab940a1dc100c  ../../precursors/loader-read.bin

They are close, but it looks like pages of (usually) 0x100 bytes are missing from the read back one. (by "missing", I mean they read all 0xFF, as if they were still erased.)

Not that it'll help at all, but here's the diff of as a hex dump: bindiff.txt

I figured I'd try building the soc images, but I can't clone betrusted-io/betrusted-soc right now -- fails during the recursive clone/submodule update cause LLVM's repo has some sort of redirect loop problem at https://git.llvm.org/git/compiler-rt/ right now. 😂 I'll give it another shot in a few hours.

Always something, isn't it?

bunnie commented 2 years ago

Ugh...I think you can ignore the failing compiler-rt dependency, we don't use it to actually build our image, and it's a relic of the underlying LiteX system. At some point they retired that submodule, and we don't use it, so it probably broke our check-in but it doesn't break our build.

I quick look at the bindiff.txt makes me think the writing process aborted at some point and wasn't successful, so there is some instability in getting data to the device. Will look more when I get back to the keyboard later today.

bunnie commented 2 years ago

It's weird that you're seeing v0.2.1 for the early loader version -- the latest stable version should be v0.2.3. You may need to run reset_soc.sh to get the latest FPGA image to load.

The loader not reading back properly is weird, because there should be a readback verification step right after the burn, but it seems like that passed. But, the signature check isn't passing, so obviously, there's a problem with the loader and I'm inclined to believe that there is a grain of truth to what you're seeing on the loader readback.

Anyways, give the reset soc thing a try and let me know if at least you're seeing the version of the SOC go to the right level...trying to nail down one thing at at time.

bunnie commented 2 years ago

I'll be afk for the next few hours, just fyi.

nitz commented 2 years ago

No worries at all about the delay between our responses! :) I imagine even our timezone difference is enough to make things wonky!

Okay, so. Before I went to bed last night, I disconnected the battery and unplugged the USB. I never like leaving devices in an "unknown" state connected to power (one little office fire has created a habit there haha.) I did leave the debug hat connected because I wasn't powering through that.

This morning I re-connected the battery, then plugged in the USB. (I'm guessing the default state there is similar to sleep ship, where it needs that connection to re-connect the battery. I really like that, and think more products should do it!)

When I did, the display popped on with a mostly solid black, and a few scattered pixels of white. Which was odd, because resetting it last night I was only seeing the loader version/sig fail.

I thought it odd enough, especially with what you're saying about the version number -- even before I did the full wipe I remember seeing v0.2.3, so I definitely had that version on there at some point.

Anyways I booted up my pi and reflashed the loader (sudo ./provision_xous.sh --fpga-skip --kernel-skip --yes), ran the resets, but saw no change.

I used a pin to hit the reset button on the board. The display didn't change, but I could hear a soft "ticking"? Around 100hz or so. Figured it might be the vibe or speaker, so I pulled those out, but the ticking persisted. Disconnected the debug hat, nothing. In fact, it didn't stop til I pulled the battery. I remembered this same ticking when I first got the device and had moved it to the stainless steel case, but it had stopped when I reconnected everything and it booted then.

Anywho, reconnected everything and tried flashing it all again, from latest:

pi@pi400:~/precursor/betrusted-scripts $ md5sum ../precursors/*
8b8f8d6afa9beefae92a6488a3d744dd  ../precursors/bt-ec.bin
9a311d89dafe317eac77510e5ea6849f  ../precursors/ec_fw.bin
27f5183070f7bc831b82a284df48d856  ../precursors/loader.bin
4f774a3b0d607622f5952fada6799523  ../precursors/soc_csr.bin
9541a3eee2d1062f4e205f4814738ce9  ../precursors/wf200_fw.bin
6a5dc47ffd98d50c192ac35e6f1d81ec  ../precursors/wfm_wf200_C0.sec
6f26ad7dbb0c62d0c066e35b063432a8  ../precursors/xous.img
pi@pi400:~/precursor/betrusted-scripts $ sudo ./config_up5k.sh
8b8f8d6afa9beefae92a6488a3d744dd  ../precursors/bt-ec.bin
Attempting to set the QE bit...
QE bit already set, doing nothing
Erasing @ 02d000 / 02e000  Done
Programming @ 02e000 / 02e000  Done
resetting fpga
pi@pi400:~/precursor/betrusted-scripts $ sudo ./wfx_image.sh
Erasing @ 0e6000 / 0e6850  Done
Programming @ 0e6850 / 04a850  Done
resetting fpga
pi@pi400:~/precursor/betrusted-scripts $ sudo ./provision_xous.sh --yes
WARNING: this script is for un-bricking devices, and will overwrite any secret keys stored in the gateware
4f774a3b0d607622f5952fada6799523  ../precursors/soc_csr.bin
27f5183070f7bc831b82a284df48d856  ../precursors/loader.bin
6f26ad7dbb0c62d0c066e35b063432a8  ../precursors/xous.img
{'P1_REVISION': 3, 'REVISION': 'c03130', 'TYPE': 'Unknown', 'MANUFACTURER': 'Sony', 'PROCESSOR': 'BCM2711', 'RAM': '4G'}
Using MMIO base: 0xFE000000
Programming .bin file to FPGA: ../../precursors/soc_csr.bin
Using JTAGSPI mode with variant 'xc7s50', to address 0x00000000, verify is on
Using erase block size of 65536 bytes
Using erase block size of 65536 bytes, erasing 40 blocks from 0x00000000 to 0x00280000 (rounded up from 0x00280000)
Erasing 100% (2621440 of 2621440) |######################################################################################################| Elapsed Time: 0:00:16 Time:  0:00:16
Programming 100% (2621440 of 2621440) |##################################################################################################| Elapsed Time: 0:00:07 Time:  0:00:07
Reading back data for verification...
Comparing data...
Programming concluded
{'P1_REVISION': 3, 'REVISION': 'c03130', 'TYPE': 'Unknown', 'MANUFACTURER': 'Sony', 'PROCESSOR': 'BCM2711', 'RAM': '4G'}
Using MMIO base: 0xFE000000
Programming .bin file to FPGA: ../../precursors/loader.bin
Using JTAGSPI mode with variant 'xc7s50', to address 0x00500000, verify is on
Using erase block size of 65536 bytes
Using erase block size of 65536 bytes, erasing 20 blocks from 0x00500000 to 0x00640000 (rounded up from 0x00639048)
Erasing 100% (6553600 of 6553600) |######################################################################################################| Elapsed Time: 0:00:06 Time:  0:00:06
Programming 100% (1282120 of 1282120) |##################################################################################################| Elapsed Time: 0:00:03 Time:  0:00:03
Reading back data for verification...
Comparing data...
Programming concluded
{'P1_REVISION': 3, 'REVISION': 'c03130', 'TYPE': 'Unknown', 'MANUFACTURER': 'Sony', 'PROCESSOR': 'BCM2711', 'RAM': '4G'}
Using MMIO base: 0xFE000000
Programming .bin file to FPGA: ../../precursors/xous.img
Using JTAGSPI mode with variant 'xc7s50', to address 0x00980000, verify is off
Using erase block size of 65536 bytes
Using erase block size of 65536 bytes, erasing 96 blocks from 0x00980000 to 0x00f80000 (rounded up from 0x00f74194)
Erasing 100% (16252928 of 16252928) |####################################################################################################| Elapsed Time: 0:00:30 Time:  0:00:30
Programming 100% (6242708 of 6242708) |##################################################################################################| Elapsed Time: 0:00:18 Time:  0:00:18
Programming concluded
pi@pi400:~/precursor/betrusted-scripts $ sudo ./reset_ec.sh && sudo ./reset_soc.sh
pi@pi400:~/precursor/betrusted-scripts $

And still just "sparkles" on the display. FPGA has some warmth to it like it's trying to do something at least. I don't really feel any coming off the EC. Reset switch wasn't doing anything so I removed the debug hat, pulled the usb, and popped/reconnected the battery. As soon as I give it the USB back, it goes back to sparkles. I can't figure out what has changed!

nitz commented 2 years ago

Ran through it all again, but when I first connected battery/usb this time, the sparkles on the screen were different. It's probably nothing (i'm just guessing junk data in the display's memory), but I did take photos of the two patterns just in case.

bunnie commented 2 years ago

hm. we might be seeing multiple issues conflated.

Can you connect to the UART on the Rpi and see anything coming out of that when you run uart_fpga.sh; reset_soc.sh, and then uart_up5k.sh; reset_ec.sh? These two commands would sequentially set the UART mux to the FPGA, reset the FPGA, and then set the UART mux to the EC, and reset it.

The "sparkles" on the display is what it looks like when it powers on without any data being written to it.

One possibility is the flex cable is not inserted all the way, or it has been damaged. The FPC is really sensitive to being tugged -- so for example, you said you've unplugged the battery, if you did that and you overshot with force and caused the display to flop over so it is face down, that can be enough energy to cause cable delamination. You can try gently pressing on the FPC joint to the CPU and you may see an image flicker on the cable, that would definitely point to the cable being an issue.

The display is a "memory LCD", which means that previously when you crashed the SoC, it will happily continue displaying whatever you last wrote in it, forever. Each pixel of the memory LCD has basically a single SRAM bit that allows it to retain images even with the host off, for better power savings.

The 100Hz ticking sounds is actually the system in a slightly confused boot state. If you started from an unplugged battery and you plug it in, there can be a race condition between the pull down that shuts down the boost regulator (for USB host mode) and the power on to the rest of the system. So what happens is the boost regulator starts to come up, the system thinks it sees power, tries to power on the system, but then the boost regulator collapses, rinse lather repeat. The ticking sound is from the inductors having their power cut repeatedly. The fix to that is to plug in USB power, after which point the EC kicks in and forces the boost regulator to the correct state instead of relying on a passive pulldown on the pin to enforce it being off.

nitz commented 2 years ago

I re-attached all the FPCs (display and debug) just in case any of it had come jiggled loose in my fumbling. I also swapped out cables I was using, opting for a power only USB-C to power the Precursor, and powering it from a separate supply than I was using to power the Pi.

Resetting it then didn't do anything; but re-flashing xous has it going, booting and everything!

I really think you were right from the very beginning in that it was the FPC being finnicky. I can't think of any time I've seen a case where it'd mostly work but just flop on some parts of the connection where it could end up with half writing to the flash?! This is why I tell everyone I'd rather something fail spectacularly, cause at least then the problem is easier to find 😂

I fetched the stable images, ran ./config_up5k.sh, wfx_image.sh, ./provision_xous.sh --yes. Powered everything down, and re-assembled the keyboard/speaker/vibe/faceplate.

Of note: even with the stable images, the first loader still showed v0.2.3 -- I still can't figure out where that v0.2.1 came from before.

Tested the vibe, speaker, sensors, accel and will fool around with the radio later but I imagine it's peachy.

So I'm up and going on stable, I'm gonna pay extra close attention when I go to update it in the future. I don't plan to generate keys or anything yet, so I'm gonna go through the whole wipe again over USB.

You can certainly close this if you'd like, but I'll be sure to follow up in the future on it either way a future update goes!

I really appreciate your thorough, thoughtful, and timely (I know you must be busy!!) help through all of this. Truly above and beyond. I only wish I had more useful info about how I got myself into this mess to start with. But hey, at least if it happens again, I know what to do!

Guess I'm out of excuses now and I'm actually gonna have to learn Rust like I told myself I would.

bunnie commented 2 years ago

I think I know where the v0.2.1 came from. That's the version that was burned into your device from the factory. That version number would have persisted in the FPGA until you got it to absorb a new gateware image, which requires hitting the reset button on the side or issuing a reset command via the connected Pi (or a cold boot). Furthermore, even a short removal of power from the memory LCD will not erase its last contents (the screen will even go blank, but then when you apply 5V it'll reappear again), so, you may have thought you removed power from the system, and you may have thought the FPGA booted and painted the v0.2.1, but it's entirely possible the FPGA was not loading a new image for a while but it was just the memory LCD showing you its last contents over and over again -- until you unplugged the battery and left for the day and overnight it would have lost its content.

The memory LCD can be very confusing to debug with, because you don't normally expect an LCD to hold an image so well.

I'll go ahead and close this just to keep my work queue tidy. The good news for you is I had budgeted about 1-2 weeks after shipping to do focused customer support and to collect info on all the things that went wrong. People who sit on their units and try to power it up some months later but encounter problems won't have as tight a turn around time, because by then I'll have hopefully resumed work on development and operations, plus travel & conferences.