delphix / appliance-build

This repository contains the code used to build the Ubuntu-based Delphix Appliance, leveraging open-source tools such as Debian's live-build, Docker, Ansible, OpenZFS, and others.
Apache License 2.0
19 stars 41 forks source link

DLPX-88573 Perform deferred upgrade prior to VDB downtime for FULL upgrade #742

Closed prakashsurya closed 11 months ago

prakashsurya commented 12 months ago

Context

We intend to modify the upgrade logic, such that we don't perform a reboot when doing FULL upgrades until after a stack restart, such that we can wait to quiesce VDBs until we're running the stack on the new version. For more context w.r.t. the motivation for doing this, see CP-10570.

Problem

Currently, when the "execute" is run, it'll automatically reboot the system after packages are upgraded, when a FULL upgrade is requested. This conflicts with the goals of CP-10570, as "execute" is run from the "old version", rather than the "new version".

Solution

The changes being made in this PR, is to modify execute to only restart delphix services, regardless if a FULL or DEFERRED upgrade is requested. This means, via the scripts, there is no longer any difference between a FULL or DEFERRED upgrade. But, with the accompanying virtualization changes, the required reboot for a FULL upgrade will now be performed by the virtualization product's upgrade logic instead.

The intention is for the virtualization product's upgrade logic to run the "execute" script to perform the package upgrades as necessary, and restart the application. Then, when the application starts back up on the new version, it'll detect a FULL upgrade was being performed at stack startup time, and automatically initiate the necessary logic to quiesce VDBs and perform the reboot.

One caveat to the approach taken in this PR, is any consumers that happened to be using the upgrade scripts to perform a FULL upgrade, will now need to manually reboot the system themselves. Argueably this fixes a bug, since previously a FULL upgrade via the scripts would not quiesce VDBs, and thus could result in problems for VDBs due to the reboot; i.e. it's now up to the user to quiesce VDBs after the upgrade, and perform the reboot.

Related Work

prakashsurya commented 11 months ago

For my knowledge, how does one quiesce VDBs while the stack is down?

you can't.. by "user" in that statement, I meant whatever happens to be orchestrating the upgrade.. which is the product's upgrade Java logic in this case.

If a FULL upgrade fails for any reason right after the execute script finishes, what could the recovery look like? We'd have to presumably now depend on the stack to successfully start such that VDBs are quiesced before the engine is rebooted which may not always be dependable as we know from recent escalations.

it'll depend exactly where the failure occurs.. but generally, if execute is run, but doesn't complete.. the fix should be to re-run execute manually, as that script is idempotent.. VDBs will be quiesced and a reboot triggered on stack start up, after execute runs to completion, and restarts the mgmt service..

in the worst of cases, where the stack doesn't come up, a "hard" reboot should be fine.. it'd be no different than a kernel crash.. sure, perhaps VDBs might not behave properly due to not having been quiesced, but that can happen at any point outside of upgrade, via a kernel panic..