This repository contains the code used to build the Ubuntu-based Delphix Appliance, leveraging open-source tools such as Debian's live-build, Docker, Ansible, OpenZFS, and others.
Apache License 2.0
19
stars
41
forks
source link
DLPX-85893 run upgrade "execute" script from separate service #731
When an upgrade fails while running the execute script, we don't have
a mechanism for the product to automatically recover.
For example, we've had cases where the OOM killer will kick in, killing
the virtualization service while it's running the execute script,
causing the entire upgrade to fail and requiring support intervention on
the system to recover.
Solution
The solution we've discussed is to run the execute portion of the
upgrade in a standalone service, such that it wouldn't be killed when
the OOM killer targets the virtualization service, and such that it
could be automatically restarted in the event that it dies for some
other reason (e.g. a kernel panic).
While this change doesn't provide the full solution, it's a step in that
direction. This change extends the execute script such that it performs
the additional steps that the upgrade script and/or the virtualization
service currently perform, after they run execute. This way, users
of this script now can run execute asynchronously, and be assured
everything needed to complete the upgrade will be performed.
The intention is for execute to be used by the virtualization service
as it is today, but eventually be called by a yet-to-be-created upgrade
service. Until the upgrade service is created, execute can be run via
systemd-run, such that it can be started by the virtualization service,
but decoupled from the virtualization service's cgroup limits (which will
help prevent OOMs).
Why?
The execute script needed to be modified for a couple of reasons:
The current execute script doesn't handle things like updating the
GRUB bootloader, nor does it handle restarting services. So, if we
wanted to have a new service to "execute" the upgrade, the service
would need to handle these additional steps. IMO, it makes sense to
encapsulate all of these steps in execute like I've done here, so that
when the new service is created, it can simply call this script.
Likewise, the virtualization service currently calls the execute
script and needs to wait for it to complete, such that it can perform
these additional tasks mentioned in the point above. By moving these
additional tasks into this script, the virtualization service can now run
the script asynchronously via systemd-run (since it no longer needs
to wait for the script to complete); which allows us to solve the
problem incrementally.
Problem
When an upgrade fails while running the
execute
script, we don't have a mechanism for the product to automatically recover.For example, we've had cases where the OOM killer will kick in, killing the virtualization service while it's running the
execute
script, causing the entire upgrade to fail and requiring support intervention on the system to recover.Solution
The solution we've discussed is to run the
execute
portion of the upgrade in a standalone service, such that it wouldn't be killed when the OOM killer targets the virtualization service, and such that it could be automatically restarted in the event that it dies for some other reason (e.g. a kernel panic).While this change doesn't provide the full solution, it's a step in that direction. This change extends the
execute
script such that it performs the additional steps that theupgrade
script and/or the virtualization service currently perform, after they runexecute
. This way, users of this script now can runexecute
asynchronously, and be assured everything needed to complete the upgrade will be performed.The intention is for
execute
to be used by the virtualization service as it is today, but eventually be called by a yet-to-be-created upgrade service. Until the upgrade service is created,execute
can be run viasystemd-run
, such that it can be started by the virtualization service, but decoupled from the virtualization service's cgroup limits (which will help prevent OOMs).Why?
The
execute
script needed to be modified for a couple of reasons:The current
execute
script doesn't handle things like updating the GRUB bootloader, nor does it handle restarting services. So, if we wanted to have a new service to "execute" the upgrade, the service would need to handle these additional steps. IMO, it makes sense to encapsulate all of these steps inexecute
like I've done here, so that when the new service is created, it can simply call this script.Likewise, the virtualization service currently calls the
execute
script and needs to wait for it to complete, such that it can perform these additional tasks mentioned in the point above. By moving these additional tasks into this script, the virtualization service can now run the script asynchronously viasystemd-run
(since it no longer needs to wait for the script to complete); which allows us to solve the problem incrementally.Testing
git-ab-pre-push
is hereRelated Work