Delta upgrade issue when using cboot

nielsavonds commented 3 years ago

Hi,

We recently started working on a new version of our device based on the Xavier NX module. By default, the meta-tegra layer configures this module to use cboot instead of UBoot.

To make this work with Mender, there is the custom libubootenv-fake scripts that are installed on the rootfs. However, when using delta updates, these are not sufficient. For more context, see https://hub.mender.io/t/delta-updates-checksum-mismatch/3312

Basically, what happens, is that the binary delta installer looks at mender_boot_part to determine which partition it should use as its base for applying the delta update. However, in the fake fw_printenv script, this will return nvbootctrl get-current-slot, essentially toggling between 0 or 1. The delta update then fails to apply because the checksum mismatches.

Would it be possible to modify this script to return the correct rootfs partition based on the slot? Or do you think this would trigger other issues? If you see no problems with this, I can try to make a pull request.

Thanks in advance, Best regards, Niels Avonds

dwalkes commented 3 years ago

Thanks @nielsavonds

However, in the fake fw_printenv script, this will return nvbootctrl get-current-slot, essentially toggling between 0 or 1. The delta update then fails to apply because the checksum mismatches.

I would expect fw_printev and nvbootctrl get-current-slot to return the current slot you are booting/running from. So for the example in https://hub.mender.io/t/delta-updates-checksum-mismatch/3312/8

 where /dev/mmcblk0p11 is the currently active partition

{
    "RootfsPartA": "/dev/mmcblk0p1",
    "RootfsPartB": "/dev/mmcblk0p11"
}

I would have expected nvbootctrl get-current-slot to return 1 in this case. So it seems like you've gotten out of sync with your rootfs/boot slot combo. We've had issues with this on uboot based builds, see https://github.com/OE4T/meta-mender-community/pull/8 for instance, but I haven't noticed this on cboot with with https://github.com/OE4T/tegra-demo-distro since we added the test script script at https://github.com/mendersoftware/meta-mender-community/tree/dunfell/meta-mender-tegra/scripts/test.

The script at https://github.com/mendersoftware/meta-mender-community/tree/dunfell/meta-mender-tegra/scripts/test is supposed to catch these issues although I'm not sure of the last time I ran it on the Xavier NX config. It was probably the last time I did a PR to meta-mender-community. See logic at https://github.com/mendersoftware/meta-mender-community/blob/dunfell/meta-mender-tegra/scripts/test/mender_tegra_test.py#L57. Based on what you've written here I'm guessing this script will fail with the combo you are using. Can you confirm?

Also note that the implementation for these scripts is changing from dunfell to later branches as we move to https://github.com/OE4T/tegra-boot-tools. See https://github.com/OE4T/tegra-demo-distro/commit/99c60c6c91bccb5ad855ecf042786ef526e84746 for instance. So we'll want to take this into account with any fix/PR for dunfell. We might want to start with a PR into https://github.com/OE4T/tegra-demo-distro and https://github.com/OE4T/meta-tegra as needed and make sure we have a combination which is working there with the torture test script.

Tagging @madisongh in case he has comments or other suggestions.

madisongh commented 3 years ago

It sounds like the delta updater is relying on a sequence where the mender client or the updater itself is using fw_setenv mender_boot_part to set the new boot partition before the delta updater does its checks. If so, then we need to rework our fake utilities to allow that to work (while, hopefully, not breaking anything else). Right now, the only variable that the fake fw_setenv does anything about is upgrade_available, since in the normal (non-delta) sequence, that's the only variable that needed it.

nielsavonds commented 3 years ago

Hi guys, thanks for your quick comments.

I think there's a slight misunderstanding here. The result of nvbootctrl get-current-slot was correct in all cases. My comments in the linked forum discussion came from different test runs, where the last one was running on /dev/mmcblk0p1 and not /dev/mmcblk0p11. So I would expect nvbootctrl get-current-slot to return 0 in that case.

The issue, however, is that fw_printenv mender_boot_part then also returns '0', the number of the slot, instead of '1', the partition number that Mender expects.

Meanwhile, I've been able to get delta upgrades working on our side by overriding the fw_printenv script in our layer. The new script looks like this (where the two variables at the top get replaced using sed commands in do_install):

#!/bin/sh
BOOT_PART_0=@MENDER_ROOTFS_PART_A_NUMBER@
BOOT_PART_1=@MENDER_ROOTFS_PART_B_NUMBER@

quiet=
if [ "$1" = "-n" ]; then
    quiet="yes"
    shift
    if [ -z "$1" ]; then
        echo "ERR: missing var name with -n" >&2
        exit 1
    fi
fi
if [ -z "$1" ]; then
    boot_slot=`nvbootctrl get-current-slot`
    [[ $boot_slot = 0 ]] && boot_part="$BOOT_PART_0" || boot_part="$BOOT_PART_1"
    echo "mender_boot_part=$boot_part"
    exit 0
fi
while [ -n "$1" ]; do
    case "$1" in
        mender_boot_part|mender_boot_part_hex)
            boot_slot=`nvbootctrl get-current-slot`
            [[ $boot_slot = 0 ]] && boot_part="$BOOT_PART_0" || boot_part="$BOOT_PART_1"
            [ -n "$quiet" ] || echo "$1=$boot_part"
            ;;
        mender_uboot_separator)
            [ -n "$quiet" ] || echo -n "$1="
            echo "something other than just 1"
            ;;
        upgrade_available)
            [ -n "$quiet" ] || echo -n "$1="
        if [ -e "/var/lib/mender/upgrade_available" ]; then
        echo "1"
        else
        echo "0"
        fi
           ;;
        *)
           echo "ERR: no such variable: $1" >&2
           exit 1
    esac
    shift
done

This way, fw_printenv mender_boot_part will return either 1 or 11, depending on the active boot slot.

dwalkes commented 3 years ago

Thanks for explaining @nielsavonds.

The new script looks like this (where the two variables at the top get replaced using sed commands in do_install):

Should the variables be set based on the content of the mender.conf file instead as done in https://github.com/mendersoftware/meta-mender-community/blob/dunfell/meta-mender-tegra/recipes-mender/tegra-state-scripts/files/redundant-boot-install-script#L35? That way the configuration changes will match whatever was used when the device was manufactured, assuming /var/lib/mender is mapped to the /data partition.

nielsavonds commented 3 years ago

Thanks Dan, you're right, that would be a better approach. Here's my second attempt. I've also fix mender_boot_part_hex in this one. I've also done some fixes in the other case statements to make sure the "quiet" parameter works properly.

#!/bin/sh
LABELCHARS="AB"

get_bootpart() {
    current_slot=`nvbootctrl get-current-slot`
    cfglbl="\"RootfsPart${LABELCHARS:$current_slot:1}\""
    devnam=`grep -h "$cfglbl:" /etc/mender/mender.conf /var/lib/mender/mender.conf | tr -d '" ,' | grep -o '[0-9]\+$'`
    echo "$devnam"
}

quiet=
if [ "$1" = "-n" ]; then
    quiet="yes"
    shift
    if [ -z "$1" ]; then
        echo "ERR: missing var name with -n" >&2
        exit 1
    fi
fi
if [ -z "$1" ]; then
    bootpart=`get_bootpart`
    echo "mender_boot_part=$bootpart"
    exit 0
fi
while [ -n "$1" ]; do
    case "$1" in
        mender_boot_part)
            bootpart=`get_bootpart`
            [ -n "$quiet" ] || echo "$1=$bootpart"
            ;;
        mender_boot_part_hex)
            bootpart=`get_bootpart`
            bootpart_hex=`echo 16o${bootpart}p | dc`
            [ -n "$quiet" ] || echo "$1=$bootpart_hex"
            ;;
        mender_uboot_separator)
            [ -n "$quiet" ] || echo "$1=something other than just 1"
            ;;
        upgrade_available)
            if [ -e "/var/lib/mender/upgrade_available" ]; then
                upgrade_available="1"
            else
                upgrade_available="0"
            fi
            [ -n "$quiet" ] || echo "$1=$upgrade_available"
           ;;
        *)
           echo "ERR: no such variable: $1" >&2
           exit 1
    esac
    shift
done

madisongh commented 3 years ago

The issue, however, is that fw_printenv mender_boot_part then also returns '0', the number of the slot, instead of '1', the partition number that Mender expects.

OK, I see. That's certainly simpler to emulate!

dwalkes commented 3 years ago

Thanks @nielsavonds looks good to me. Could you prepare a PR for this and, if possible, could you check to see if it still works with https://github.com/OE4T/tegra-demo-distro/ dunfell-l4t-r32.4.3 branch and the test script at https://github.com/mendersoftware/meta-mender-community/blob/dunfell/meta-mender-tegra/scripts/test/mender_tegra_test.py#L57 ?

nielsavonds commented 3 years ago

I've created the pull request here: https://github.com/OE4T/meta-mender-community/pull/11 Unfortunately, I'm not able to test with the tegra-demo-distro at this time.

dwalkes commented 3 years ago

Upstreamed with https://github.com/mendersoftware/meta-mender-community/pull/221

OE4T / meta-mender-community

Delta upgrade issue when using cboot #10