delphix / appliance-build

This repository contains the code used to build the Ubuntu-based Delphix Appliance, leveraging open-source tools such as Debian's live-build, Docker, Ansible, OpenZFS, and others.
Apache License 2.0
19 stars 41 forks source link

DLPX-85893 run upgrade "execute" script from separate service #731

Closed prakashsurya closed 11 months ago

prakashsurya commented 1 year ago

Problem

When an upgrade fails while running the execute script, we don't have a mechanism for the product to automatically recover.

For example, we've had cases where the OOM killer will kick in, killing the virtualization service while it's running the execute script, causing the entire upgrade to fail and requiring support intervention on the system to recover.

Solution

The solution we've discussed is to run the execute portion of the upgrade in a standalone service, such that it wouldn't be killed when the OOM killer targets the virtualization service, and such that it could be automatically restarted in the event that it dies for some other reason (e.g. a kernel panic).

While this change doesn't provide the full solution, it's a step in that direction. This change extends the execute script such that it performs the additional steps that the upgrade script and/or the virtualization service currently perform, after they run execute. This way, users of this script now can run execute asynchronously, and be assured everything needed to complete the upgrade will be performed.

The intention is for execute to be used by the virtualization service as it is today, but eventually be called by a yet-to-be-created upgrade service. Until the upgrade service is created, execute can be run via systemd-run, such that it can be started by the virtualization service, but decoupled from the virtualization service's cgroup limits (which will help prevent OOMs).

Why?

The execute script needed to be modified for a couple of reasons:

  1. The current execute script doesn't handle things like updating the GRUB bootloader, nor does it handle restarting services. So, if we wanted to have a new service to "execute" the upgrade, the service would need to handle these additional steps. IMO, it makes sense to encapsulate all of these steps in execute like I've done here, so that when the new service is created, it can simply call this script.

  2. Likewise, the virtualization service currently calls the execute script and needs to wait for it to complete, such that it can perform these additional tasks mentioned in the point above. By moving these additional tasks into this script, the virtualization service can now run the script asynchronously via systemd-run (since it no longer needs to wait for the script to complete); which allows us to solve the problem incrementally.

Testing

Related Work

prakashsurya commented 1 year ago

@seb @palashgandhi I think now that the min version is set to 6.0.17, we can re-integrate this change.

prakashsurya commented 1 year ago

Actually, nevermind.. the min version still is 6.0.15.. closing until we get to 6.0.17..