AsteroidOS / meta-asteroid

OpenEmbedded layer that provides the basis of AsteroidOS
GNU General Public License v2.0
78 stars 45 forks source link

Bricking issues caused by `mkboot.bbclass` #175

Closed argosphil closed 1 year ago

argosphil commented 1 year ago

This is an attempt to summarize the bug identified by @FlorentRevest as the likely cause of the recent bricking issues. Any mistakes are likely to be entirely my fault.

mkboot.bbclass (as well as mkbootimg.bbclass and abootimg.bbclass) contains the following code:

do_deploy:append() {
    ...
    install -d ${D}/${KERNEL_IMAGEDEST}
    install -m 0644 ${B}/boot.img ${D}/${KERNEL_IMAGEDEST}
}

Note that while this is a do_deploy rule, the file is created in ${D}, not ${DEPLOYDIR}.

Further code:

pkg_postinst_ontarget:${KERNEL_PACKAGE_NAME}-image:append () {
    if [ ! -e /boot/boot.img ] ; then
        # if the boot image is not available here something went wrong and we don't
        # continue with anything that can be dangerous
        exit 1
    fi

    BOOT_PARTITION_NAMES="LNX boot KERNEL"
    for i in $BOOT_PARTITION_NAMES; do
        path=$(find /dev -name "*$i*"|grep disk| head -n 1)
        [ -n "$path" ] && break
    done

    if [ -z "$path" ] ; then
        echo "Boot partition does not exist!"
        exit 1
    fi

    echo "Flashing the new kernel /boot/boot.img to $path"
    dd if=/boot/boot.img of=$path
}

Note that the pattern passed to find is *$i*, which matches "aboot" when $i is "boot".

What happens in an ordinary build is this:

  1. linux-sparrow:do_package is called and generates an empty kernel-image package
  2. linux-sparrow:do_deploy is called and modifies the ${D} dir by putting a boot.img file in it
  3. during installation, the fastboot image is manually flashed to the boot partition
  4. after installation, there is a successful boot and the postinst handlers are run
  5. this specific postinst handler looks for /boot/boot.img, doesn't find it, and exits
  6. everything is fine

But what appears to happen sometimes is this:

  1. linux-sparrow:do_package is called and generates an empty kernel-image package
  2. linux-sparrow:do_deploy is called and modifies the ${D} dir by putting a boot.img file in it
  3. for some unknown reason, linux-sparrow:do_package is called again and generates a kernel-image package containing /boot/boot.img
  4. during installation, the fastboot image is manually flashed to the boot partition
  5. after installation, there is a successful boot and the postinst handlers are run
  6. this specific postinst handler looks for /boot/boot.img, finds it, and looks for a partition to install it to
  7. the find pattern matches aboot and boot. Since busybox find doesn't sort its output alphabetically, it's probably not deterministic that the first hit is aboot, but it's possible
  8. dd if=/boot/boot.img of=/dev/disk/by-partlabel/aboot is called and overwrites the wrong partition
  9. the device is bricked

Again, the discovery isn't mine at all, I'm just trying to write this down since it's a bit long for the chat. Mistakes entirely mine.

The next step would be to figure out what the unknown reason is. Bitbake's documentation isn't clear on this, but I think linux-sparrow:do_install empties ${D}, so that would suggest it is not rerun but linux-sparrow:do_package is. I'd like to know how that can happen, so we can fix the two bugs above (use "$i" as a pattern, include boot_a in the list of potential partitions, don't install boot.img) and verify the bug doesn't reappear.

I hope this is somehow helpful, but if it isn't feel free to close the issue.

argosphil commented 1 year ago

One commit which triggers this (linux-sparrow:do_package is rerun without linux-sparrow:do_install being run first) is oe-core's https://github.com/openembedded/openembedded-core/commit/80839835ec9fcb63069289225a3c1af257ffdef7 , which modifies the package script itself. Note that it is unlikely to be the specific cause in our case because it didn't happen quite at the right time.

ETA: I meant to say it's probably a similar commit which triggered this bug. I am convinced it is our bug and it was triggered by a similar commit.