betrusted-io / betrusted-wiki

Looking for docs on Precursor/Betrusted? Start here.
https://github.com/betrusted-io/betrusted-wiki/wiki
Apache License 2.0
173 stars 6 forks source link

Updating device failed #20

Closed coder543 closed 2 years ago

coder543 commented 2 years ago

I was using usbipd to redirect the USB device from Windows into the WSL2 environment in order to run factory_reset.sh. I recognized that this might have some risks beyond doing it natively in Windows, but setting up the environment needed on Windows was something I wanted to avoid if possible. I used a very high quality USB-C cable that I typically use for high speed data transfers from my Sony camera.

An interesting consequence of usbipd is that you have to reattach the USB device to WSL2 every time it disconnects. The script would get through one phase, then fail because the device was gone. So, for the sake of simplicity, I wrote a three line BAT script to infinitely keep reattaching the device to WSL2, which resolved this issue.

For reasons I haven't figured out, it was able to flash the loader, and xous.img, but it timed out while writing soc_csr.bin. It erased it successfully, but then only wrote 35% before erroring out.

Unfortunately, this has left the device in a state where factory_reset.sh -s complains that... Could not find a valid csr.csv descriptor on the device, aborting!.

I see in the Wiki that I could go through the full Failsafe update, but that legitimately looks like a lot of hassle. I'm sure I'll get around to it eventually if needed, but...

My actual question: I am curious whether the device is in a state that actually requires a Failsafe update, or whether the factory_reset.sh script simply doesn't handle this case yet? The device actually boots up into the new firmware (it has the updated GUI that wasn't present in the factory build), and I can issue commands on the device... nothing seems soft-bricked. It feels like the factory_reset.sh script just needs to barge on and ignore the lack of a csr.csv descriptor in a case like this, but I've been using the device for a total of like 30 minutes, so I'm definitely speaking from a place of ignorance.

Full output of the time that it failed:

$ sudo ./factory_reset.sh -s
Updating to latest stabilized release
--2022-03-18 17:28:01--  https://ci.betrusted.io/releases/latest/loader.bin
Resolving ci.betrusted.io (ci.betrusted.io)... 132.147.127.39
Connecting to ci.betrusted.io (ci.betrusted.io)|132.147.127.39|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1282120 (1.2M) [application/octet-stream]
Saving to: ‘/tmp/loader.bin’

/tmp/loader.bin               100%[=================================================>]   1.22M   575KB/s    in 2.2s

2022-03-18 17:28:05 (575 KB/s) - ‘/tmp/loader.bin’ saved [1282120/1282120]

Using SoC v0.9.7-0-ga5d09b9 registers
Halting CPU.
Programming loader image /tmp/loader.bin
ID code bytes 1-2: 0x8080c2c2
ID code bytes 2-3: 0x3b3b8080
Erasing 100% (1282120 of 1282120) |##############################################| Elapsed Time: 0:00:04 Time:  0:00:04
Erase finished
Writing 100% (1282120 of 1282120) |##############################################| Elapsed Time: 0:00:26 Time:  0:00:26
Write finished
Skipped verification at user request
Resuming CPU.
Resetting SOC...
waiting for device to reboot
--2022-03-18 17:28:42--  https://ci.betrusted.io/releases/latest/xous.img
Resolving ci.betrusted.io (ci.betrusted.io)... 132.147.127.39
Connecting to ci.betrusted.io (ci.betrusted.io)|132.147.127.39|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6159800 (5.9M) [application/octet-stream]
Saving to: ‘/tmp/xous.img’

/tmp/xous.img                 100%[=================================================>]   5.87M   979KB/s    in 6.1s

2022-03-18 17:28:50 (979 KB/s) - ‘/tmp/xous.img’ saved [6159800/6159800]

Using SoC v0.9.7-0-ga5d09b9 registers
Halting CPU.
Programming kernel image /tmp/xous.img
ID code bytes 1-2: 0x8080c2c2
ID code bytes 2-3: 0x3b3b8080
Erasing 100% (6159800 of 6159800) |##############################################| Elapsed Time: 0:00:23 Time:  0:00:23
Erase finished
Writing 100% (6159800 of 6159800) |##############################################| Elapsed Time: 0:02:04 Time:  0:02:04
Write finished
Skipped verification at user request
Resuming CPU.
Resetting SOC...
waiting for device to reboot
--2022-03-18 17:31:22--  https://ci.betrusted.io/releases/latest/soc_csr.bin
Resolving ci.betrusted.io (ci.betrusted.io)... 132.147.127.39
Connecting to ci.betrusted.io (ci.betrusted.io)|132.147.127.39|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2621440 (2.5M) [application/octet-stream]
Saving to: ‘/tmp/soc_csr.bin’

/tmp/soc_csr.bin              100%[=================================================>]   2.50M   870KB/s    in 2.9s

2022-03-18 17:31:27 (870 KB/s) - ‘/tmp/soc_csr.bin’ saved [2621440/2621440]

Using SoC v0.9.7-0-ga5d09b9 registers
Halting CPU.
Programming SoC gateware /tmp/soc_csr.bin
ID code bytes 1-2: 0x8080c2c2
ID code bytes 2-3: 0x3b3b8080
Erasing 100% (2621440 of 2621440) |##############################################| Elapsed Time: 0:00:09 Time:  0:00:09
Erase finished
Writing  35% (936704 of 2621440) |################                               | Elapsed Time: 0:00:19 ETA:   0:00:34Traceback (most recent call last):
  File "./usb_update.py", line 635, in <module>
    main()
  File "./usb_update.py", line 585, in main
    pc_usb.flash_program(locs['LOC_SOC'][0], image, verify=verify)
  File "./usb_update.py", line 356, in flash_program
    status = self.flash_rdsr(1)
  File "./usb_update.py", line 154, in flash_rdsr
    return self.peek(self.register('spinor_cmd_rbk_data'), display=False)
  File "./usb_update.py", line 37, in peek
    numread = self.dev.ctrl_transfer(bmRequestType=(0x80 | 0x43), bRequest=0,
  File "/usr/local/lib/python3.8/dist-packages/usb/core.py", line 1082, in ctrl_transfer
    ret = self._ctx.backend.ctrl_transfer(
  File "/usr/local/lib/python3.8/dist-packages/usb/backend/libusb1.py", line 893, in ctrl_transfer
    ret = _check(self.lib.libusb_control_transfer(
  File "/usr/local/lib/python3.8/dist-packages/usb/backend/libusb1.py", line 602, in _check
    raise USBTimeoutError(_strerror(ret), ret, _libusb_errno[ret])
usb.core.USBTimeoutError: [Errno 110] Operation timed out
Writing 100% (2621440 of 2621440) |##############################################| Elapsed Time: 0:00:22 Time:  0:00:22

Attempting to run the command again:

$ sudo ./factory_reset.sh -s
Updating to latest stabilized release
--2022-03-18 17:55:23--  https://ci.betrusted.io/releases/latest/loader.bin
Resolving ci.betrusted.io (ci.betrusted.io)... 132.147.127.39
Connecting to ci.betrusted.io (ci.betrusted.io)|132.147.127.39|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1282120 (1.2M) [application/octet-stream]
Saving to: ‘/tmp/loader.bin’

/tmp/loader.bin               100%[=================================================>]   1.22M   583KB/s    in 2.1s

2022-03-18 17:55:27 (583 KB/s) - ‘/tmp/loader.bin’ saved [1282120/1282120]

Could not find a valid csr.csv descriptor on the device, aborting!
bunnie commented 2 years ago

If you haven't cut power, try running this sequence of commands:

wget https://ci.betrusted.io/releases/latest/soc_csr.bin -O /tmp/soc_csr.bin
./usb_update.py --soc /tmp/soc_csr.bin --force
rm /tmp/soc_csr.bin

This should get your soc_csr.bin into a good state by forcing the write. If you cut power though, you'll have to do the failsafe method. Sorry for the hassle...I'll be able to respond more later today, on the road at the moment.

coder543 commented 2 years ago

I haven't cut power, so I'm happy to try that out. Unfortunately, the result seems the same.

After wget-ing the file successfully:

$ sudo ./usb_update.py --soc /tmp/soc_csr.bin --force
Could not find a valid csr.csv descriptor on the device, aborting!
coder543 commented 2 years ago

I could certainly comment out the digest check that is causing the script to fail, if you think that's a good idea, but I don't want to mess things up by being overly eager if there's a chance of avoiding the failsafe method.

bunnie commented 2 years ago

Humm... I'm on mobile right now. it would not hurt your device to leave it plugged into USB, the charger is autonomous from the soc core. I can respond properly in about 4 hours.

can you try putting the '--force' argument first in the argument list? it should not be position sensitive but it's also weird its getting stuck there. if you look inside the python script it should be using a backup set of register locations embedded in the script instead of looking for the csv file embedded in the device when the force argument is specified...

thanks,

-b.

On March 19, 2022 8:42:31 AM GMT+08:00, Josh Leverette @.***> wrote:

I could certainly comment out the digest check that is causing the script to fail, if you think that's a good idea, but I don't want to mess things up by being overly eager if there's a chance of avoiding the failsafe method.

-- Reply to this email directly or view it on GitHub: https://github.com/betrusted-io/betrusted-wiki/issues/20#issuecomment-1072905451 You are receiving this because you commented.

Message ID: @.***>

coder543 commented 2 years ago

The control flow path is something like this:

def main():
    parser = argparse.ArgumentParser(description="Update/upload to a Precursor device running Xous 0.8/0.9")
    parser.add_argument(
        "--soc", required=False, help="'Factory Reset' the SoC gateware. Note: this will overwrite any secret keys stored in your device!", type=str, nargs='?', metavar=('SoC gateware file'), const='../precursors/soc_csr.bin'
    )
    # [...]

    # !!! this is where things go wrong, it calls `load_csrs()` before checking the `force` argument
    pc_usb.load_csrs() # prime the CSR values
    if "v0.8" in pc_usb.gitrev:
        # [...]
    elif "v0.9" in pc_usb.gitrev:
        # [...]
    elif args.force == True:
        # try the v0.9 offsets
        locs = {
           "LOC_SOC"    : [0x00000000, "soc_csr.bin"],
           "LOC_STAGING": [0x00280000, "pass"],
           "LOC_LOADER" : [0x00500000, "loader.bin"],
           "LOC_KERNEL" : [0x00980000, "xous.img"],
           "LOC_WF200"  : [0x07F80000, "pass"],
           "LOC_EC"     : [0x07FCE000, "pass"],
           "LOC_AUDIO"  : [0x06340000, "short_8khz.wav"],
           "LEN_AUDIO"  : [0x01C40000, "pass"],
           "LOC_PDDB"   : [0x01D80000, "pass"],
        }
    else:
        print("SoC is from an unknow rev '{}', use --force to continue anyways with v0.9 firmware offsets".format(pc_usb.load_csrs()))
        exit(1)
    def load_csrs(self):
        LOC_CSRCSV = 0x20277000 # this address shouldn't change because it's how we figure out our version number

        csr_data = self.burst_read(LOC_CSRCSV, 0x8000)
        hasher = hashlib.sha512()
        hasher.update(csr_data[:0x7FC0])
        digest = hasher.digest()
        if digest != csr_data[0x7fc0:]:
            print("Could not find a valid csr.csv descriptor on the device, aborting!")
            exit(1)
            # !!! then, we exit the program here because the csr.csv descriptor is invalid
        # [...]

I'll definitely leave the Precursor plugged in. It's already 8pm here, so I might be asleep before you're properly online, but we'll see. I definitely appreciate you getting back to my question as quickly as you have.

bunnie commented 2 years ago

I just realized that your problem is not that the programming script can't find the csr area, it's that they don't exist because they got erased. I will need some time to replicate your problem and test a patch. if you can keep in plugged in overnight I will do my best to get you a tested hot fix for this by the time you wake up.

basically I will bundle a full copy of the intended csr/csv set and some sanity checks as the work around to your problem.

bunnie commented 2 years ago

Ok, I hopefully have a fix for you.

Please pull the latest main and run the following commands:

wget https://ci.betrusted.io/releases/latest/soc_csr.bin
python .\usb_update.py --soc soc_csr.bin --force --override-csr .\csr_data_0.9.7.bin

If this is successful, clean up the soc_csr.bin file with rm soc_csr.bin.

What this should do is tell the updating script to ignore whatever is inside the FLASH, and use this pre-specified copied of the CSRs. Based on the logs it looks like you previously had 0.9.7 in there, so this should match what you're running on in the CPU. If you get some weird error, maybe try 0.9.5...because that's what the device shipped with. But I tested this just now on a device where I aborted the burn after an erase, and was able to replicate your error and also fix the error.

I've also redone the factory update script so that the device boot is atomic with the update by partially erasing the loader, leaving the device in a "bad signature" state which holds the CPU in a loop so it doesn't do naughty things like try to watchdog reset itself because the FPGA is inconsistent with the OS, and only after all the artifacts are burned does it simultaneously burn the FPGA and a good loader copy in the same invocation of the script.

Anyways -- I hope this gets you out of your pickle. Sorry for the troubles.

coder543 commented 2 years ago

Ok, that worked great, as far as I can tell!

After that, I performed the factory_reset.sh -s again, which succeeded. Then I used a paperclip as instructed, then applied power to boot it up again. After that, I ran ecup auto on device, which seemed to start applying the staged updates. I looked away for a minute and looked back to see it had a blank screen and was unresponsive, which was mildly confusing, but if I've learned anything about this device... connecting it to power is usually the next step, so I did that and it started up. Since I was unsure if it actually applied successfully, I ran ecup auto again expecting it to instantly say there was no new update to apply, but it went through the process again, so I watched more closely and saw it say at the end that it was shutting down. During the ecup auto, it printed out that I should run ecup reset, so after booting up again, I ran that. Then I used the paperclip again for good measure and applied power again.

Everything seems fine, so I believe it all worked. (As you may notice that I used the paperclip twice, I thought the instructions there were a little confusing as to when you're supposed to use it. After running ecup auto, I believe the correct time to use it is after ecup auto and ecup reset, but I'm not sure if ecup reset is even needed.)

On an unrelated note, just a quick question: one thing I find confusing is that there's only a sleep mode, no menu item to shut down completely. Since this is a development kit, there's a good chance that I will not be using it continuously forever, which leaves the option of letting it just run out of battery in a drawer when I'm not actively using it, or opening it up and physically disconnecting the battery, or... maybe the paperclip turns the device off completely?

Thanks for coming up with a solution so quickly! It was probably my fault in the first place for trying to do this under WSL2.

bunnie commented 2 years ago

I actually test under WSL2 and ubuntu, so, probably more likely this is indicative of a shortcoming of my test environment than anything else. Thanks for confirming it worked for you. i'm going to close this for now, if you have a new issue, please feel free to open a new issue!