OE4T / tegra-boot-tools

Boot-related tools for Tegra platforms
MIT License
13 stars 8 forks source link

Many upgrade failures with "ERR: cannot perform bootloader update" #29

Closed nielsavonds closed 1 year ago

nielsavonds commented 1 year ago

Hi guys,

We're seeing a lot of failures when we upgrade our devices with the following message: ERR: cannot perform bootloader update

We then created a build with a patched Mender script. Here's the patch for reference:

--- meta-mender-community/meta-mender-tegra/recipes-mender/tegra-state-scripts/files/redundant-boot-install-script-uboot        2023-06-29 16:33:18.398067637 +0200
+++ meta-nobi/meta-nobi-tegra/recipes-mender/tegra-state-scripts/files/redundant-boot-install-script-uboot      2023-07-02 16:21:23.247559416 +0200
@@ -60,7 +60,7 @@
     # If the tool reports that the version partitions are corrupted, this is an update on a tegra210
     # device with the old partition layout where the U-Boot environment overwrote the version partition(s),
     # in which case we recover via complete initialization.
-    if chroot "${mnt}" /usr/bin/tegra-bootloader-update --dry-run /opt/ota_package/bl_update_payload 2>&1 | grep -q 'version partitions are corrupted'; then
+    if chroot "${mnt}" /usr/bin/tegra-bootloader-update --dry-run /opt/ota_package/bl_update_payload 2>&1 | tee /tmp/bl_update_output | grep -q 'version partitions are corrupted'; then
        # For the recoverable case, we will have also detected a change the U-Boot environment change
        if [ -n "$install_fwenv" ]; then
            echo "Detected bootloader partition upgrade, reinitializing" >&2
@@ -76,11 +76,15 @@
        fi
     else
        echo "ERR: cannot perform bootloader update" >&2
+       echo "tegra-bootloader-update output:" >&2
+       cat /tmp/bl_update_output >&2
        cleanup
        exit 1
     fi
-elif ! chroot "${mnt}" /usr/bin/tegra-bootloader-update /opt/ota_package/bl_update_payload; then
+elif ! chroot "${mnt}" /usr/bin/tegra-bootloader-update /opt/ota_package/bl_update_payload > /tmp/bl_update_output; then
     echo "ERR: bootloader update failed" >&2
+    echo "tegra-bootloader-update output:" >&2
+    cat /tmp/bl_update_output >&2
     cleanup
     exit 1
 fi

We then get the following output in Mender:

ERR: cannot perform bootloader update
tegra-bootloader-update output:
/opt/ota_package/bl_update_payload: Cannot allocate memory

This leads me to the following lines in the tegra-bootloader-update.c file. It's the perror below that triggers the printed error:

    bupctx = bup_init(argv[optind]);
    if (bupctx == NULL) {
        perror(argv[optind]);
        return 1;
    }

Digging into bup_init I found this:

#define BUFFERSIZE (1024 * 1024 * 1024)
[...]
ctx->buffer = malloc(BUFFERSIZE);

I've tried to investigate what this buffer is used for, but to me it seems like it's just used to load the bup payload file into memory? I'm wondering why a buffer of 1GiB is allocated to do that? Can we reduce this size safely?

Thanks! Niels

madisongh commented 1 year ago

I'm wondering why a buffer of 1GiB is allocated to do that? Can we reduce this size safely?

That's in the original code, and I think I used that as a "temporary" value for the size until I worked out better logic for it. I should have marked it with FIXME or XXX to remind me to do that, but didn't, so it turned out not to be so temporary after all.

It should be safe to use stat() to get the actual size of the BUP contents, then use that for allocation. If you can manage a patch to do that, great; otherwise, I'll get to it when I have some time.

nielsavonds commented 1 year ago

I've opened a pull request (https://github.com/OE4T/tegra-boot-tools/pull/30) so you can review the code, but haven't tested it yet. I'll test it in the coming week and report back with the results.

Thanks for the prompt response!

nielsavonds commented 1 year ago

I've now tested it on a TX2, TX2 NX, Xavier NX and Jetson Nano and all have successfully updated.

madisongh commented 1 year ago

Thanks for the PR, which looks good. I've gone ahead and merged #30, run some further tests, and will release a new version soon.