This repository contains the code used to build the Ubuntu-based Delphix Appliance, leveraging open-source tools such as Debian's live-build, Docker, Ansible, OpenZFS, and others.
Apache License 2.0
19
stars
41
forks
source link
DLPX-89392 Race between reboot and grub install #747
With the help of @grwilson, we think the issue is a race between the upgrade script’s execute script calling systemctl reboot, and the virtualization’s upgrade logic calling rootfs-container set-bootfs, which modifies/updates GRUB.
Specifically, the virtualization logic on 16.0 and earlier, looks like this:
i.e. run the execute script synchronously, and then update GRUB synchronously after that.
This was OK, up until the changes we made in 17.0 via DLPX-85893 .. in those changes, we modified the execute script to call exec systemctl reboot, when previously it would not.
Thus, I believe the problem is as follows:
virtualization calls execute
execute calls exec systemctl reboot
systemctl reboot returns success, while reboot hasn’t happened yet
execute returns control back to virtualization
virtualization then calls trySetBootFs which runs rootfs-container set-bootfs container to update GRUB
while grub is in the middle of being modified (from 5), the reboot occurs (from 3)
Since step 5 couldn’t run to completion, GRUB is left in a “corrupted” state, and the reboot isn’t able to boot up successfully.
Affected Versions
Since we didn’t add the systemctl reboot to the execute until 17.0, I think this issue is limited to upgrades TO 17.0 and greater, FROM 16.0 or less.
I don’t believe upgrades FROM 17.0 are affected (e.g. 17.0 to 18.0), because the virtualization upgrade code was modified to run the execute script asynchronously. Thus, while the virtualization logic will still run trySetBootFs to update GRUB, it’s unlikely those modifications to GRUB will conflict/race with the reboot triggered by the execute script. The execute script can take a long time to run, so in practice I’d expect the virtualization logic to finish its modifications to GRUB, prior to execute initiating the reboot.
I also don’t think upgrade FROM 18.0 and greater are affected (e.g. 18.0 to 19.0), because the execute script will not initiate a reboot when upgrading from these versions, due to changes made in DLPX-88573. The reboot will only occur on versions FROM 17.0 or less.
Workaround
The workaround for any affected versions, would be to perform a DEFERRED upgrade (perhaps immediately followed by a FINISH DEFERRED upgrade), and not use FULL upgrades.
Solution
I think the proper solution for this, would is to modify the execute script, such that it never returns, and instead waits for the reboot; this way, the virtualization logic is unable to call trySetBootFs.
Problem
With the help of @grwilson, we think the issue is a race between the upgrade script’s
execute
script callingsystemctl reboot
, and the virtualization’s upgrade logic callingrootfs-container set-bootfs
, which modifies/updates GRUB.Specifically, the virtualization logic on 16.0 and earlier, looks like this:
i.e. run the
execute
script synchronously, and then update GRUB synchronously after that.This was OK, up until the changes we made in 17.0 via DLPX-85893 .. in those changes, we modified the
execute
script to callexec systemctl reboot
, when previously it would not.Thus, I believe the problem is as follows:
execute
execute
callsexec systemctl reboot
systemctl reboot
returns success, while reboot hasn’t happened yetexecute
returns control back to virtualizationtrySetBootFs
which runsrootfs-container set-bootfs
container to update GRUBSince step 5 couldn’t run to completion, GRUB is left in a “corrupted” state, and the reboot isn’t able to boot up successfully.
Affected Versions
Since we didn’t add the
systemctl reboot
to theexecute
until 17.0, I think this issue is limited to upgrades TO 17.0 and greater, FROM 16.0 or less.I don’t believe upgrades FROM 17.0 are affected (e.g. 17.0 to 18.0), because the virtualization upgrade code was modified to run the
execute
script asynchronously. Thus, while the virtualization logic will still runtrySetBootFs
to update GRUB, it’s unlikely those modifications to GRUB will conflict/race with the reboot triggered by theexecute
script. Theexecute
script can take a long time to run, so in practice I’d expect the virtualization logic to finish its modifications to GRUB, prior toexecute
initiating the reboot.I also don’t think upgrade FROM 18.0 and greater are affected (e.g. 18.0 to 19.0), because the execute script will not initiate a reboot when upgrading from these versions, due to changes made in DLPX-88573. The reboot will only occur on versions FROM 17.0 or less.
Workaround
The workaround for any affected versions, would be to perform a DEFERRED upgrade (perhaps immediately followed by a FINISH DEFERRED upgrade), and not use FULL upgrades.
Solution
I think the proper solution for this, would is to modify the execute script, such that it never returns, and instead waits for the reboot; this way, the virtualization logic is unable to call
trySetBootFs
.Related Work
Testing
git-ab-pre-push
is here