ajeddeloh commented 6 years ago

We want to bring forward Container Linux's automatic rollback model and probably extend it even further. Automatic rollbacks can't solve every problem (since in some cases it may mean downgrading something like docker which is an unsupported operation) but it works well to protect against kernel issues and other such problems.

CL's model currently uses the GPT attribute bits to record if a partition has been tried or not and if was successfully booted. On a successful boot update_engine waits 45 seconds then marks the boot as successful.

We're not using A/B partitions in FCOS, so we can't use the GPT priority bits (and I think we shouldn't regardless, but that's beside the point).

Ostree currently does not support automatic rollback (@cgwalters please correct me if I'm wrong), so we'll need to implement it.

note: I'm going to use the term "install" to mean an ostree/kernel combo (for FCOS, this would be a kernel/usr-partition for CL).

Essentially there are four states an install can be in (in order of what should be chosen to boot) 1) Untested (just installed) 2) Successful (most recent) 3) Successful (fallback, mostly for manual recovery) 4) Failed

Goals I think we ought to strive for:

Handling boot all failures, including kernel failures.
Allowing users to decide what a "successful" boot means (related: greenboot)
Avoid "flapping" where the OS gets stuck in a loop of "install bad image, reboot, fail, reboot"
Never be in a state where it is unclear what the correct choice to boot (even if power is lost mid-update)
Allow manual selection of installs if necesary
With the exception of the first boot, always try to keep two successful installs around (to avoid problems like https://github.com/coreos/bugs/issues/2457#issuecomment-397469282)

My proposal:

So I think we should use flag files in /boot like we do for Ignition. When creating a new install, it's kernel gets written to /boot along with two flag files: untested and failed. There should only ever be one install with both flags. Additionally there should be a flag file "recent" which indicates which install to boot in the case of two successful installs.

Here is a table of what combinations of flags mean what:

Untest  Failed  Recent  Meaning
N   N   N   Successful boot, fallback
N   N   Y   Successful boot, most recent
N   Y   N   Tried, unsuccessful boot, last resort only
N   Y   Y   Successful boot (probably), most recent, power most likely lost before failed flag was removed
Y   N   N   Impossible. Machine should die. Something went horribly wrong
Y   N   Y   Impossible. Machine should die. Something went horribly wrong
Y   Y   N   Just installed, boot this if available
Y   Y   Y   Impossible. Machine should die. Something went horribly wrong

The grub config should select installs in this order: 1) installs with both untested and failed flags 2) installs with the recent flag 3) installs with no flags 4) installs with just the failed flag.

When grub selects one it immediately removes the untested flag. On a successful boot a systemd unit (tbd: integrate this with greenboot?) adds the recent flag, removes the recent flag from the old entry, then removes the failed flag.

This proposal does hinge on grub being able to delete files, which I haven't confirmed yet. It also means ostree wouldn't need to write out any grub configs at all, just empty files.

Edit: hrmmm. Grub doesn't seem to be able to write to or delete files. That makes the whole "recover from a bad kernel" bit hard.

Thoughts?

cc @cgwalters and @jlebon for the ostree bits and @bgilbert to keep me honest about how CL works.

dustymabe commented 6 years ago

cc @LorbusChris

jlebon commented 6 years ago

For more context on this, automatic rollback was one of the main objectives of the GSOC project of which @LorbusChris took part. Coincidentally, there was work at the same time in both grub2 and systemd to support boot counting (see https://github.com/systemd/systemd/pull/9437 and https://github.com/rhboot/grub2/pull/24).

One of the outcomes was that we could standardize on boot-complete.target as the target the system needs to reach to be considered a good boot regardless of the bootloader used. So hopefully we'd be able to exploit grub2's new grubenv boot counter support and greenboot here.

ajeddeloh commented 6 years ago

Huh, didn't know about grubenv. Looks very useful! I'm fine with boot-complete.target. My proposal could be amended to write out the state to the grubenv instead.

I'm still a fan of:

Lean grub configuration, probably written by hand. Make it simple and easy to understand. Gut everything we don't use. More complexity is more places to break and this needs to be bulletproof.
Static grub configuration. Grub should be choosing what to boot based on tries, what was successful, etc. Having userspace tools rewrite grub configuration to add new entries seems more brittle.

dustymabe commented 6 years ago

a lot of discussion about boot counting and success determination happened over in the greenboot GSOC project for fedora-iot: https://pagure.io/fedora-iot/issue/12

Specifically I think we got it down to two variables for state that needed to be tracked.. here is a high level state diagram we ended up with.

dustymabe commented 6 years ago

Goals I think we ought to strive for:
* Handling boot all failures, including kernel failures.

+1

* Allowing users to decide what a "successful" boot means (related: [greenboot](https://github.com/LorbusChris/greenboot))

+1, but we can ship some defaults I think that will help

* Avoid "flapping" where the OS gets stuck in a loop of "install bad image, reboot, fail, reboot"

+1 we'll need to add some logic so that the updater knows an update failed and it shouldn't retry it

* Never be in a state where it is unclear what the correct choice to boot (even if power is lost mid-update)

I think in greenboot we decided to always try to boot the last known successful option

* Allow manual selection of installs if necesary

+1, we already have this today

* With the exception of the first boot, always try to keep two successful installs around (to avoid problems like [coreos/bugs#2457 (comment)](https://github.com/coreos/bugs/issues/2457#issuecomment-397469282))

we do keep two deployments around today, but they aren't guaranteed to be 'successful installs'. i.e. if you attempt upgrade and it fails and you rollback then you'll only have two deployments, but one of them will be "bad". I'd say that's the one exception.

ajeddeloh commented 6 years ago

Regarding that state diagram: LG (although I don't know if there's much point in a boot counter starting at 2; if it fails once that's probably a good reason to not try again) but we need to figure out what to do when trying to pick between multiple successful entries (both to make sure the correct one is gc'd and to know which to boot). I also don't think we should ever be in a state where we have 1 successful install (other than first boot). I.e. if a install fails, don't gc the successful one) other than the first install.

jlebon commented 6 years ago

I also don't think we should ever be in a state where we have 1 successful install (other than first boot). I.e. if a install fails, don't gc the successful one) other than the first install.

I'll just make a note that we discussed this during the community meeting, and we concluded that the OSTree model is less susceptible to issues like coreos/bugs#2457 since we don't actually GC until we successfully prepare the next root (before reboot).

But of course, preparing the next root successfully != guaranteed successful boot in that root. There is still a risk that we have a deployment which successfully prepares the root (and cleans up the previous deployment), but borks it in a subtle enough way that it's not actually bootable.

cgwalters commented 6 years ago

guaranteed successful boot in that root

One other thing to keep in mind here is that today, rpm-ostree hardcodes running /bin/true in the new root before staging/deployment.

We could easily support generalizing this to running arbitrary code in the new root as a container before even trying to boot it for real.

Of course, if you're getting your ostree commits from any OS vendor that doesn't suck, they should have been tested server side. And if you're using package layering, you're going to end up running scripts in the new root which do more than /bin/true.

But - the capbility is there for us to do something more sophisticated if we wanted to.

cgwalters commented 6 years ago

My opinion on this is that until we have a design that has automated tests and has been carefully audited for correctness, we shouldn't ship anything here. I have a short term PR to disable the current grub2 behavior that I think we should also apply to FAH29.

cgwalters commented 6 years ago

There was a thread on the XFS list about the grubenv approach: https://marc.info/?l=linux-xfs&m=153740791327073&w=2

Migrated to include linux-fsdevel: https://marc.info/?l=linux-fsdevel&m=153741350128439&w=2

TL;DR: The filesystem developers are against it.

ajeddeloh commented 6 years ago

My opinion on this is that until we have a design that has automated tests and has been carefully audited for correctness, we shouldn't ship anything here.

I'm inclined to agree.

The filesystem developers are against [grub-env]

That is... unfortunate. It looks like that's one of the only places that grub can write to and grub does need to write to something if we want to handle failures where we can't get to (working) userspace. What really sucks is worst case we just need 9 bits total (grub only needs to write tries which are 0, 1, or 2 plus a priority bit for each install which there can be at max 3). Ugh.

I'm not sure what exactly to do about that. We're left with three options: 1) Throw out the ability to rollback in some cases (BOOOOO) 2) Use grub-env and hope it's ok 3) Roll our own grub-env like thing (we can find 1k of space somewhere off-filesystem for it) and hack up grub to support that. Hell we could stick it in the embedding area if we really wanted.

Option 1 is a non-starter for me; it defeats the point. Option 2 isn't great but might work as a stopgap. If we started with 2 and planned to move to 3 we'd also need a migration plan. 3 is also not great because writing good bootloader code is hard and error prone.

ajeddeloh commented 6 years ago

Proposal:

1) Use grub-env to maintain metadata about deployments. The grub-env will be the source of truth about the status of the deployments. 1) The metadata stored in the grub-env can be used to determine an ordering of boot preference. Both grub and ostree use the same logic to convert from metadata to ordering. 1) Instead of ostree keeping an ordered list of deployments, there are three "slots" that deployments can be installed to. The slots are unordered. When a new deployment happens (e.g. on upgrade) the "worst" (i.e. oldest or failed) slot is overwritten. 1) grub uses the metadata to pick the best slot to boot 1) The grub menu lists an option to autoselect the best slot in addition to manual entries (like what CL does today) 1) ostree or some other userspace utility can edit grub-env to manually rollback

Next steps/problems to solve:

Find a way to use grub-env that doesn't have the problems @cgwalters brought up.
Determine what state each install will have and how that maps to an ordering
Get coreos-assembler to install grub manually instead of via anaconda (so we can ship custom static grub configs). Bonus we can do UEFI+bios at the same time.
Write the grub config to pick the best deployment based on the grub-env
Teach ostree how to find the worst deployment and overwrite it instead of doing reordering
Teach ostree/some userspace utility how to rollback

Misc notes:

Creating a new deployment does not touch the other deployments (win!) except the one it overwrites
Three slots is a good idea, even if we only use 2. This allows us to install to the worst slot and only after the install is successful we delete the now worst slot.
package layering could be treated just like an upgrade

dustymabe commented 6 years ago

Instead of ostree keeping an ordered list of deployments, there are three "slots" that deployments can be installed to.

Instead of using three "slots" could we possible just track two entries that we care about (tracked via grubenv vars) and ignore any other entries? or does the "slotting" give us other features?

jlebon commented 6 years ago

The grub-env will be the source of truth about the status of the deployments. ... Instead of ostree keeping an ordered list of deployments, there are three "slots" that deployments can be installed to.

Can you provide more motivation for these? Thinking more on it, I think I understand where it's coming from, but without the motivation explicitly written out, it's hard to provide useful feedback/improvements.

So IIUC, it essentially comes down to

(1) the only place we can write data to from GRUB is the env block, (2) the GRUB script that actually determines what to boot needs access to both deployment success and how new each deployment is (basically what was previously expressed through BLS entry ordering), and (3) building on top of the current OSTree logic would require keeping things in sync, which is prone to failure & race conditions.

Is that more or less correct?

ajeddeloh commented 6 years ago

Yeah, that's correct, plus a little extra. To summarize: 1) Only 1 source of truth (grub env block) 1) Updates don't touch anything to do with other deployments, other than the one being replaced. 1) Makes it easy to have a static grub config (no need to ship any grub tools) 1) IMO its simpler/cleaner than creating new grub entries, but that might be my CL background speaking. 1) As a bonus: this doesn't need symlinks, so fat32 could work (really not sure if we want that, that's not a discussion for here)

cgwalters commented 6 years ago

The grub-env will be the source of truth about the status of the deployments.

This makes sense, though I am a little worried about the complexity of teaching ostree to maintain a static number of deployments. Or alternatively, to dynamically expand the grub env block.

Makes it easy to have a static grub config (no need to ship any grub tools)

Note that this is somewhat orthogonal; since grub has learned to parse the BLS fragments. I am not sure if there are any blockers to just turning that on.

dustymabe commented 6 years ago

Note that this is somewhat orthogonal; since grub has learned to parse the BLS fragments. I am not sure if there are any blockers to just turning that on.

interesting.. @ajeddeloh, @LorbusChris, could you include that in the investigation?

ajeddeloh commented 6 years ago

Ultimately we need to ensure that the ordering logic for grub and ostree is the same (so ostree overwrites the right deployments). I think we still want the grub env to be the source of truth, right? This is a shift from ostree maintaining to the source of truth, but it really should be something that both ostree and grub can access. I think it's critical we get this sorted out first, since it impacts everything else.

BLS fragments could be useful for pinning deployments. I need to dig into how they work under the hood (i.e. is it some fancy grub script or is it baked into grub itself) but I can imagine have the 2/3 deployments that are managed by the grub-env plus any number of pinned ones. You're boot menu could look like:

> FCOS default 
  FCOS A (new)
  FCOS B (empty)
  FCOS C (good)
  FCOS <hash> (pinned)

This assumes the BLS is implemented in a a way where entries can be merged with a static config.

bgilbert commented 5 years ago

I'm generally +1 to a static (or mostly static) handwritten config. One caveat though: on CL, our kernel command line has changed over time, and we don't have any way to update old bootloader configs. This means that new OS releases have to work with old kernel command lines, forever. It'd be good to avoid that on FCOS. Maybe the command line could come from a GRUB fragment installed alongside each kernel?

ajeddeloh commented 5 years ago

+1 to updatable snippets, but I think it's important to note that these should be carefully chosen and not generated. Generated grub snippets tend to contain a lot of cruft that doesn't always apply and makes determining what is needed/not needed hard in addition to making it harder to read.

cgwalters commented 5 years ago

I like the idea of ostree commits containing the defaults: https://github.com/ostreedev/ostree/issues/479

(A grub fragment would mean the BLS configs are not truth)

ajeddeloh commented 5 years ago

BLS configs are grub specific. Does ostree expose any sort of bootloader agnostic source of truth with the same info that would be used to generate the BLS config / other bootloader's config? (i.e. deployment X has kernel X, initramfs(s) Y, bootcsum Z etc).

cgwalters commented 5 years ago

BLS configs are grub specific.

I'm confused - nothing in /boot/loader should be GRUB specific right?

ostree by default for upgrades uses the current deployment's BLS config as the kernel arguments for the new deployment. However, one can add/remove args when making new deployments. (You can also set the kargs for an existing deployment although I would tend to discourage this)

The idea with that ostree issue is that it'd be kind of like /usr/lib/boot/loader - kernel args we expect to be there, and if we drop one out, it should go away. (Although properly implementing this would probably want a way to override base arguments)

ajeddeloh commented 5 years ago

Arg, I'm mistaken. BLS configs are not grub specific. Correct me if I'm wrong but if ostree does not detect a bootloader then it doesn't write out the BLS configs, right? I'm looking for a way of querying ostree to say "what are the bits that would go into a BLS config" without actually creating one.

ostree by default for upgrades uses the current deployment's BLS config as the kernel arguments for the new deployment. However, one can add/remove args when making new deployments. (You can also set the kargs for an existing deployment although I would tend to discourage this)

That's not contained in the ostree commit then is it?

cgwalters commented 5 years ago

Correct me if I'm wrong but if ostree does not detect a bootloader then it doesn't write out the BLS configs, right?

ostree always writes out the BLS configs - the BLS configs are the source of truth for the list of deployments. If you had no configs, ostree admin cleanup would try to delete everything and our last ditch protection that avoids deleting the booted root filesystem would kick in.

If ostree doesn't detect your bootloader it won't e.g. regenerate grub.cfg. But ostree never reads grub.cfg.

Try booting fcos and do:

mkdir /boot/loader.1
ln -Tsfr /boot/loader{.1,}
ostree admin status

It'll barf because it can't find your booted deployment anymore.

That's not contained in the ostree commit then is it?

Right, not today; the kernel args live in the BLS fragments.

ajeddeloh commented 5 years ago

Right, not today; the kernel args live in the BLS fragments.

I think this should come from the commits. I'm not sure how I feel about user supplied args and how they should be managed. In my ideal world they'd be completely separate from the BLS config and get pulled in by the static grub config. Whether they are part of a deployment or exist outside of it (like the static grub config) is another question. I'm not sure if grub's current BLS implementation allows adding on extra bits to the menuentries it generates though, which would make separating them impossible.

dustymabe commented 5 years ago

I think this should come from the commits.

yeah. I think colin referenced this RFE already: https://github.com/ostreedev/ostree/issues/479 - i can maybe try to find someone to work on that.

I'm not sure how I feel about user supplied args and how they should be managed. In my ideal world they'd be completely separate from the BLS config and get pulled in by the static grub config.

we already manage user supplied args with rpm-ostree kargs, right? could we not just have the config get updated appropriately when someone uses that interface?

ajeddeloh commented 5 years ago

we already manage user supplied args with rpm-ostree kargs, right? could we not just have the config get updated appropriately when someone uses that interface?

But then the args in the BLS config wouldn't be from the commit, they'd be from the commit + user specified. I suppose we could combine them at deploy time, but it'd be nice to have a clear separation of what is part of the ostree commit and what is not.

dustymabe commented 5 years ago

I suppose we could combine them at deploy time, but it'd be nice to have a clear separation of what is part of the ostree commit and what is not.

can we not have both? i.e. we can combine them at deploy time, but store them separately so that there is a clear separation (at least to someone investigating a problem).

ajeddeloh commented 5 years ago

That's better than not storing them separately, but it'd be better if ostree didn't need to combine them. One less thing to go wrong or to confuse users. In general the less merging/mangling/etc of configs the better (says the vocal supporter of ct). If we can keep them separate I don't see a reason not to.

cmurf commented 5 years ago

Since you're always creating an EFI system partition, and it's FAT, just use that for grubenv whether UEFI or BIOS. This way you're always using FAT for grubenv. And it's a non-journaled, non-checksummed file systems that grubenv was intended for.

It is a slightly dirty hack, because why would a BIOS system use an ESP? Well, that's bullet 4 in the Bootloaderspec - it says to use it as $BOOT. If you still think it's dirty, bullet 3 gives you a way out, change the partition type GUID of that partition from "EFI System" to "Extended Boot" - which you can do during first boot on non-UEFI systems as detected by a lack of efivars.

Also, I'm pretty convinced you can make changes on FAT atomic: https://github.com/ostreedev/ostree/pull/1873#issuecomment-504573588

cgwalters commented 2 years ago

In discussion with the IoT folks we had some agreement that it was time to drive this functionality into ostree https://github.com/ostreedev/ostree/issues/2725

coreos / fedora-coreos-tracker

Determine how to handle automatic rollback #47

Proposal: