Open ajeddeloh opened 6 years ago
cc @LorbusChris
For more context on this, automatic rollback was one of the main objectives of the GSOC project of which @LorbusChris took part. Coincidentally, there was work at the same time in both grub2 and systemd to support boot counting (see https://github.com/systemd/systemd/pull/9437 and https://github.com/rhboot/grub2/pull/24).
One of the outcomes was that we could standardize on boot-complete.target
as the target the system needs to reach to be considered a good boot regardless of the bootloader used. So hopefully we'd be able to exploit grub2's new grubenv boot counter support and greenboot here.
Huh, didn't know about grubenv. Looks very useful! I'm fine with boot-complete.target
. My proposal could be amended to write out the state to the grubenv instead.
I'm still a fan of:
a lot of discussion about boot counting and success determination happened over in the greenboot GSOC project for fedora-iot: https://pagure.io/fedora-iot/issue/12
Specifically I think we got it down to two variables for state that needed to be tracked.. here is a high level state diagram we ended up with.
Goals I think we ought to strive for:
* Handling boot all failures, including kernel failures.
+1
* Allowing users to decide what a "successful" boot means (related: [greenboot](https://github.com/LorbusChris/greenboot))
+1, but we can ship some defaults I think that will help
* Avoid "flapping" where the OS gets stuck in a loop of "install bad image, reboot, fail, reboot"
+1 we'll need to add some logic so that the updater knows an update failed and it shouldn't retry it
* Never be in a state where it is unclear what the correct choice to boot (even if power is lost mid-update)
I think in greenboot we decided to always try to boot the last known successful option
* Allow manual selection of installs if necesary
+1, we already have this today
* With the exception of the first boot, always try to keep two successful installs around (to avoid problems like [coreos/bugs#2457 (comment)](https://github.com/coreos/bugs/issues/2457#issuecomment-397469282))
we do keep two deployments around today, but they aren't guaranteed to be 'successful installs'. i.e. if you attempt upgrade and it fails and you rollback then you'll only have two deployments, but one of them will be "bad". I'd say that's the one exception.
Regarding that state diagram: LG (although I don't know if there's much point in a boot counter starting at 2; if it fails once that's probably a good reason to not try again) but we need to figure out what to do when trying to pick between multiple successful entries (both to make sure the correct one is gc'd and to know which to boot). I also don't think we should ever be in a state where we have 1 successful install (other than first boot). I.e. if a install fails, don't gc the successful one) other than the first install.
I also don't think we should ever be in a state where we have 1 successful install (other than first boot). I.e. if a install fails, don't gc the successful one) other than the first install.
I'll just make a note that we discussed this during the community meeting, and we concluded that the OSTree model is less susceptible to issues like coreos/bugs#2457 since we don't actually GC until we successfully prepare the next root (before reboot).
But of course, preparing the next root successfully != guaranteed successful boot in that root. There is still a risk that we have a deployment which successfully prepares the root (and cleans up the previous deployment), but borks it in a subtle enough way that it's not actually bootable.
guaranteed successful boot in that root
One other thing to keep in mind here is that today, rpm-ostree hardcodes running /bin/true in the new root before staging/deployment.
We could easily support generalizing this to running arbitrary code in the new root as a container before even trying to boot it for real.
Of course, if you're getting your ostree commits from any OS vendor that doesn't suck, they should have been tested server side. And if you're using package layering, you're going to end up running scripts in the new root which do more than /bin/true
.
But - the capbility is there for us to do something more sophisticated if we wanted to.
My opinion on this is that until we have a design that has automated tests and has been carefully audited for correctness, we shouldn't ship anything here. I have a short term PR to disable the current grub2 behavior that I think we should also apply to FAH29.
There was a thread on the XFS list about the grubenv approach: https://marc.info/?l=linux-xfs&m=153740791327073&w=2
Migrated to include linux-fsdevel: https://marc.info/?l=linux-fsdevel&m=153741350128439&w=2
TL;DR: The filesystem developers are against it.
My opinion on this is that until we have a design that has automated tests and has been carefully audited for correctness, we shouldn't ship anything here.
I'm inclined to agree.
The filesystem developers are against [grub-env]
That is... unfortunate. It looks like that's one of the only places that grub can write to and grub does need to write to something if we want to handle failures where we can't get to (working) userspace. What really sucks is worst case we just need 9 bits total (grub only needs to write tries which are 0, 1, or 2 plus a priority bit for each install which there can be at max 3). Ugh.
I'm not sure what exactly to do about that. We're left with three options: 1) Throw out the ability to rollback in some cases (BOOOOO) 2) Use grub-env and hope it's ok 3) Roll our own grub-env like thing (we can find 1k of space somewhere off-filesystem for it) and hack up grub to support that. Hell we could stick it in the embedding area if we really wanted.
Option 1 is a non-starter for me; it defeats the point. Option 2 isn't great but might work as a stopgap. If we started with 2 and planned to move to 3 we'd also need a migration plan. 3 is also not great because writing good bootloader code is hard and error prone.
1) Use grub-env to maintain metadata about deployments. The grub-env will be the source of truth about the status of the deployments. 1) The metadata stored in the grub-env can be used to determine an ordering of boot preference. Both grub and ostree use the same logic to convert from metadata to ordering. 1) Instead of ostree keeping an ordered list of deployments, there are three "slots" that deployments can be installed to. The slots are unordered. When a new deployment happens (e.g. on upgrade) the "worst" (i.e. oldest or failed) slot is overwritten. 1) grub uses the metadata to pick the best slot to boot 1) The grub menu lists an option to autoselect the best slot in addition to manual entries (like what CL does today) 1) ostree or some other userspace utility can edit grub-env to manually rollback
Next steps/problems to solve:
Misc notes:
Instead of ostree keeping an ordered list of deployments, there are three "slots" that deployments can be installed to.
Instead of using three "slots"
could we possible just track two entries that we care about (tracked via grubenv vars) and ignore any other entries? or does the "slotting" give us other features?
The grub-env will be the source of truth about the status of the deployments. ... Instead of ostree keeping an ordered list of deployments, there are three "slots" that deployments can be installed to.
Can you provide more motivation for these? Thinking more on it, I think I understand where it's coming from, but without the motivation explicitly written out, it's hard to provide useful feedback/improvements.
So IIUC, it essentially comes down to
(1) the only place we can write data to from GRUB is the env block, (2) the GRUB script that actually determines what to boot needs access to both deployment success and how new each deployment is (basically what was previously expressed through BLS entry ordering), and (3) building on top of the current OSTree logic would require keeping things in sync, which is prone to failure & race conditions.
Is that more or less correct?
Yeah, that's correct, plus a little extra. To summarize: 1) Only 1 source of truth (grub env block) 1) Updates don't touch anything to do with other deployments, other than the one being replaced. 1) Makes it easy to have a static grub config (no need to ship any grub tools) 1) IMO its simpler/cleaner than creating new grub entries, but that might be my CL background speaking. 1) As a bonus: this doesn't need symlinks, so fat32 could work (really not sure if we want that, that's not a discussion for here)
The grub-env will be the source of truth about the status of the deployments.
This makes sense, though I am a little worried about the complexity of teaching ostree to maintain a static number of deployments. Or alternatively, to dynamically expand the grub env block.
Makes it easy to have a static grub config (no need to ship any grub tools)
Note that this is somewhat orthogonal; since grub has learned to parse the BLS fragments. I am not sure if there are any blockers to just turning that on.
Note that this is somewhat orthogonal; since grub has learned to parse the BLS fragments. I am not sure if there are any blockers to just turning that on.
interesting.. @ajeddeloh, @LorbusChris, could you include that in the investigation?
Ultimately we need to ensure that the ordering logic for grub and ostree is the same (so ostree overwrites the right deployments). I think we still want the grub env to be the source of truth, right? This is a shift from ostree maintaining to the source of truth, but it really should be something that both ostree and grub can access. I think it's critical we get this sorted out first, since it impacts everything else.
BLS fragments could be useful for pinning deployments. I need to dig into how they work under the hood (i.e. is it some fancy grub script or is it baked into grub itself) but I can imagine have the 2/3 deployments that are managed by the grub-env plus any number of pinned ones. You're boot menu could look like:
> FCOS default
FCOS A (new)
FCOS B (empty)
FCOS C (good)
FCOS <hash> (pinned)
This assumes the BLS is implemented in a a way where entries can be merged with a static config.
I'm generally +1 to a static (or mostly static) handwritten config. One caveat though: on CL, our kernel command line has changed over time, and we don't have any way to update old bootloader configs. This means that new OS releases have to work with old kernel command lines, forever. It'd be good to avoid that on FCOS. Maybe the command line could come from a GRUB fragment installed alongside each kernel?
+1 to updatable snippets, but I think it's important to note that these should be carefully chosen and not generated. Generated grub snippets tend to contain a lot of cruft that doesn't always apply and makes determining what is needed/not needed hard in addition to making it harder to read.
I like the idea of ostree commits containing the defaults: https://github.com/ostreedev/ostree/issues/479
(A grub fragment would mean the BLS configs are not truth)
BLS configs are grub specific. Does ostree expose any sort of bootloader agnostic source of truth with the same info that would be used to generate the BLS config / other bootloader's config? (i.e. deployment X has kernel X
, initramfs(s) Y
, bootcsum Z
etc).
BLS configs are grub specific.
I'm confused - nothing in /boot/loader
should be GRUB specific right?
ostree by default for upgrades uses the current deployment's BLS config as the kernel arguments for the new deployment. However, one can add/remove args when making new deployments. (You can also set the kargs for an existing deployment although I would tend to discourage this)
The idea with that ostree issue is that it'd be kind of like /usr/lib/boot/loader
- kernel args we expect to be there, and if we drop one out, it should go away. (Although properly implementing this would probably want a way to override base arguments)
Arg, I'm mistaken. BLS configs are not grub specific. Correct me if I'm wrong but if ostree does not detect a bootloader then it doesn't write out the BLS configs, right? I'm looking for a way of querying ostree to say "what are the bits that would go into a BLS config" without actually creating one.
ostree by default for upgrades uses the current deployment's BLS config as the kernel arguments for the new deployment. However, one can add/remove args when making new deployments. (You can also set the kargs for an existing deployment although I would tend to discourage this)
That's not contained in the ostree commit then is it?
Correct me if I'm wrong but if ostree does not detect a bootloader then it doesn't write out the BLS configs, right?
ostree always writes out the BLS configs - the BLS configs are the source of truth for the list of deployments. If you had no configs, ostree admin cleanup
would try to delete everything and our last ditch protection that avoids deleting the booted root filesystem would kick in.
If ostree doesn't detect your bootloader it won't e.g. regenerate grub.cfg
. But ostree never reads grub.cfg
.
Try booting fcos and do:
mkdir /boot/loader.1
ln -Tsfr /boot/loader{.1,}
ostree admin status
It'll barf because it can't find your booted deployment anymore.
That's not contained in the ostree commit then is it?
Right, not today; the kernel args live in the BLS fragments.
Right, not today; the kernel args live in the BLS fragments.
I think this should come from the commits. I'm not sure how I feel about user supplied args and how they should be managed. In my ideal world they'd be completely separate from the BLS config and get pulled in by the static grub config. Whether they are part of a deployment or exist outside of it (like the static grub config) is another question. I'm not sure if grub's current BLS implementation allows adding on extra bits to the menuentries it generates though, which would make separating them impossible.
I think this should come from the commits.
yeah. I think colin referenced this RFE already: https://github.com/ostreedev/ostree/issues/479 - i can maybe try to find someone to work on that.
I'm not sure how I feel about user supplied args and how they should be managed. In my ideal world they'd be completely separate from the BLS config and get pulled in by the static grub config.
we already manage user supplied args with rpm-ostree kargs
, right? could we not just have the config get updated appropriately when someone uses that interface?
we already manage user supplied args with rpm-ostree kargs, right? could we not just have the config get updated appropriately when someone uses that interface?
But then the args in the BLS config wouldn't be from the commit, they'd be from the commit + user specified. I suppose we could combine them at deploy time, but it'd be nice to have a clear separation of what is part of the ostree commit and what is not.
I suppose we could combine them at deploy time, but it'd be nice to have a clear separation of what is part of the ostree commit and what is not.
can we not have both? i.e. we can combine them at deploy time, but store them separately so that there is a clear separation (at least to someone investigating a problem).
That's better than not storing them separately, but it'd be better if ostree didn't need to combine them. One less thing to go wrong or to confuse users. In general the less merging/mangling/etc of configs the better (says the vocal supporter of ct
). If we can keep them separate I don't see a reason not to.
Since you're always creating an EFI system partition, and it's FAT, just use that for grubenv whether UEFI or BIOS. This way you're always using FAT for grubenv. And it's a non-journaled, non-checksummed file systems that grubenv was intended for.
It is a slightly dirty hack, because why would a BIOS system use an ESP? Well, that's bullet 4 in the Bootloaderspec - it says to use it as $BOOT. If you still think it's dirty, bullet 3 gives you a way out, change the partition type GUID of that partition from "EFI System" to "Extended Boot" - which you can do during first boot on non-UEFI systems as detected by a lack of efivars.
Also, I'm pretty convinced you can make changes on FAT atomic: https://github.com/ostreedev/ostree/pull/1873#issuecomment-504573588
In discussion with the IoT folks we had some agreement that it was time to drive this functionality into ostree https://github.com/ostreedev/ostree/issues/2725
We want to bring forward Container Linux's automatic rollback model and probably extend it even further. Automatic rollbacks can't solve every problem (since in some cases it may mean downgrading something like docker which is an unsupported operation) but it works well to protect against kernel issues and other such problems.
CL's model currently uses the GPT attribute bits to record if a partition has been tried or not and if was successfully booted. On a successful boot update_engine waits 45 seconds then marks the boot as successful.
We're not using A/B partitions in FCOS, so we can't use the GPT priority bits (and I think we shouldn't regardless, but that's beside the point).
Ostree currently does not support automatic rollback (@cgwalters please correct me if I'm wrong), so we'll need to implement it.
note: I'm going to use the term "install" to mean an ostree/kernel combo (for FCOS, this would be a kernel/usr-partition for CL).
Essentially there are four states an install can be in (in order of what should be chosen to boot) 1) Untested (just installed) 2) Successful (most recent) 3) Successful (fallback, mostly for manual recovery) 4) Failed
Goals I think we ought to strive for:
My proposal:
So I think we should use flag files in
/boot
like we do for Ignition. When creating a new install, it's kernel gets written to/boot
along with two flag files:untested
andfailed
. There should only ever be one install with both flags. Additionally there should be a flag file "recent" which indicates which install to boot in the case of two successful installs.Here is a table of what combinations of flags mean what:
The grub config should select installs in this order: 1) installs with both
untested
andfailed
flags 2) installs with therecent
flag 3) installs with no flags 4) installs with just thefailed
flag.When grub selects one it immediately removes the
untested
flag. On a successful boot a systemd unit (tbd: integrate this with greenboot?) adds the recent flag, removes the recent flag from the old entry, then removes the failed flag.This proposal does hinge on grub being able to delete files, which I haven't confirmed yet. It also means ostree wouldn't need to write out any grub configs at all, just empty files.
Edit: hrmmm. Grub doesn't seem to be able to write to or delete files. That makes the whole "recover from a bad kernel" bit hard.
Thoughts?
cc @cgwalters and @jlebon for the ostree bits and @bgilbert to keep me honest about how CL works.