delphix / appliance-build

This repository contains the code used to build the Ubuntu-based Delphix Appliance, leveraging open-source tools such as Debian's live-build, Docker, Ansible, OpenZFS, and others.
Apache License 2.0
19 stars 41 forks source link

DLPX-89392 Race between reboot and grub install #747

Closed prakashsurya closed 9 months ago

prakashsurya commented 9 months ago

Problem

With the help of @grwilson, we think the issue is a race between the upgrade script’s execute script calling systemctl reboot, and the virtualization’s upgrade logic calling rootfs-container set-bootfs, which modifies/updates GRUB.

Specifically, the virtualization logic on 16.0 and earlier, looks like this:

    try {
      ExecuteUtils.executeSudo(script, "-p", UpgradeManagerImpl.getPlatform());
    } catch (NonZeroExitException e) {
      throw new VersionSwitchFailed(e, e.getMessage());
    }
    this.trySetBootFs(UpgradeUtility.getMountedRootFSContainerDataset().lastElement());

i.e. run the execute script synchronously, and then update GRUB synchronously after that.

This was OK, up until the changes we made in 17.0 via DLPX-85893 .. in those changes, we modified the execute script to call exec systemctl reboot, when previously it would not.

Thus, I believe the problem is as follows:

  1. virtualization calls execute
  2. execute calls exec systemctl reboot
  3. systemctl reboot returns success, while reboot hasn’t happened yet
  4. execute returns control back to virtualization
  5. virtualization then calls trySetBootFs which runs rootfs-container set-bootfs container to update GRUB
  6. while grub is in the middle of being modified (from 5), the reboot occurs (from 3)

Since step 5 couldn’t run to completion, GRUB is left in a “corrupted” state, and the reboot isn’t able to boot up successfully.

Affected Versions

Since we didn’t add the systemctl reboot to the execute until 17.0, I think this issue is limited to upgrades TO 17.0 and greater, FROM 16.0 or less.

I don’t believe upgrades FROM 17.0 are affected (e.g. 17.0 to 18.0), because the virtualization upgrade code was modified to run the execute script asynchronously. Thus, while the virtualization logic will still run trySetBootFs to update GRUB, it’s unlikely those modifications to GRUB will conflict/race with the reboot triggered by the execute script. The execute script can take a long time to run, so in practice I’d expect the virtualization logic to finish its modifications to GRUB, prior to execute initiating the reboot.

I also don’t think upgrade FROM 18.0 and greater are affected (e.g. 18.0 to 19.0), because the execute script will not initiate a reboot when upgrading from these versions, due to changes made in DLPX-88573. The reboot will only occur on versions FROM 17.0 or less.

Workaround

The workaround for any affected versions, would be to perform a DEFERRED upgrade (perhaps immediately followed by a FINISH DEFERRED upgrade), and not use FULL upgrades.

Solution

I think the proper solution for this, would is to modify the execute script, such that it never returns, and instead waits for the reboot; this way, the virtualization logic is unable to call trySetBootFs.

Related Work

Testing