coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
264 stars 59 forks source link

Throttled update rollouts #83

Closed bgilbert closed 4 years ago

bgilbert commented 5 years ago

Build a system for gradually rolling out OS updates to Fedora CoreOS machines in a way that can be centrally managed by FCOS developers.

Background: Container Linux

Container Linux updates are rolled out gradually, typically over a 24 to 72 hour period. If a major bug is caught before the rollout is complete, we can suspend the rollout while we investigate.

On CL, rollouts are implemented in CoreUpdate by hashing the machine ID with the new OS version and comparing a few bits against a threshold which increases over time. If a machine automatically checks in but doesn't meet the current threshold, CoreUpdate responds that no update is available. If the user manually initiates an update request, the threshold is ignored and CoreUpdate provides the update.

Major bugs can be caught in two ways. CoreUpdate has a status reporting mechanism, so we can notice if many machines are failing to update. The status reports are not very granular, however, and thus not very debuggable. More commonly, a user reports a problem to coreos/bugs and we manually triage the issue and pause rollout if the problem appears serious.

For each update group (~= channel), CoreUpdate only knows about one target OS version at any given time. This is awkward for several reasons. If a machine running version A is not yet eligible for a new version C currently being rolled out, or if the rollout of C is paused, the machine should still be able to update to the previous version B, but CoreUpdate doesn't know how to do that. In addition, there's no way to require that machines on versions < P first update to P before updating to any version > P. (We call that functionality an "update barrier".) As a result, compatibility hacks for updating from particular old versions have to be carried in the OS forever, since there can be no guarantee that all older machines have updated.

CoreUpdate collects metrics about each machine that checks in: its update channel, its state in the client state machine, what OS version is running, what version was originally installed, the OEM ID (platform) of the machine, and its checkin history. This works okay but gives us an incomplete picture of the installed base: we do not receive any information about machines behind private CoreUpdate servers, behind a third-party update server such as CoreRoller, or which have updates disabled.

Fedora CoreOS

CoreUpdate, update_engine, and the Omaha protocol will not be used in Fedora CoreOS. A successor update protocol, Cincinnati, is being developed for OpenShift, and it appears that we'll be able to adapt it for Fedora CoreOS. I believe this involves:

  1. Using the existing Cincinnati wire protocol and server code
  2. Writing a graph builder to generate an update graph from FCOS release metadata
  3. Writing an update client that queries a Cincinnati policy engine and invokes rpm-ostree

Server requirements

...and nice-to-haves, and reifications of the second-system effect:

Client requirements

Metrics

Metrics should be handled by a separate system. Coupling metrics to the update protocol would provide the same sort of incomplete picture as in CoreUpdate, and in any event Cincinnati is not designed to collect client metrics. This probably means that certain of the features above, such as automatic rollout suspension or insight into the client state machine, will need to be handled outside the update protocol.

cc @crawford for corrections and additional details.

crawford commented 5 years ago

@steveej and I have just started the process of defining how an admin provides a custom policy to the Policy Engine. FCOS will likely need a custom policy to implement channels and rate-limiting. We'll want to take into account your use-case as we finalize the design. Can you provide more specifics around your proposed update policy? It would be helpful to see things like inputs to the policy (e.g. the machine-id, admin-controlled rate limits), external services which must be consulted (e.g. the metrics collection server), and the criteria for offering a particular update payload.

lucab commented 5 years ago

I think for FCOS we can proceed top-to-bottom here, and design/prototype pieces starting from the graph-builder till we reach the update-client (and the reboot manager later on).

Regarding the graph-builder, I have the following points for which I'd like to see some inputs (from @bgilbert and @dustymabe especially):

  1. Source schema/format for release metadata (I assume online and air-gapped cases share the same metadata format)
  2. Metadata signing and key rotation
  3. Metadata storage (random options: a remote bucket, a web directory, a local mount of a remote volume, a list of tags on a docker-registry repo, a list of tags on a git repo, a database)
  4. Number of endpoints (possible dimensions for splitting: stream, oem, architecture, region)
  5. Related to the answer above, client request parameters (for shared endpoints, to provide client-specific release metadata). Also, should this include a machine-specific ID?
  6. Push vs pull based (former: release captain signals the server that a metadata change happened; latter: server periodically polls for metadata changes).
  7. Failure mode (if the metadata source is unreachable, should the server provide a stale graph or should it return an error to the client?)

Other assorted comments:

Rollouts are automatically paused based on client error reports

This indeed requires a metrics collector in place, which has to be queried by the policy engine.

Supports update barriers

This can be done either at metadata source (by maintaining a linear chain as the update path, or by explicitly whitelisting/blacklisting all paths) or as a policy rule (by selecting specific edges to cut). The former is static at graph-ingestion time but verbose/cumbersome, the latter is dynamic and likely more flexible but needs to be computed at request-time. Not sure which one we prefer.

Provides insight into the client state machine

For anything outside of "client request parameters" point above, I think this belongs to the "metrics" topic. Do you have specific example of relevant state to be collected which may not be directly in the set of request key parameters?

dustymabe commented 5 years ago

Thanks @bgilbert for writing this up and for the context of how CoreUpdate worked in the past with CL.

I don't have any objections with the proposal and like a lot of what I hear. Update barriers will especially be useful in being able to drop legacy hacks/workarounds.

@crawford Can you provide more specifics around your proposed update policy?

@bgilbert are we ready to answer this question yet? Should we brainstorm a bit soon in order to be able to provide more data?

dustymabe commented 5 years ago

@lucab Regarding the graph-builder, I have the following points for which I'd like to see some inputs (from @bgilbert and @dustymabe especially):

1. Source schema/format for release metadata (I assume online and air-gapped cases share the same metadata format)

seems reasonable to me that they would share the same metadata format

2. Metadata signing and key rotation

I think we need to collaborate with fedora infra on this.

3. Metadata storage (random options: a remote bucket, a web directory, a local mount of a remote volume, a list of tags on a docker-registry repo, a list of tags on a git repo, a database)

all possible options. a list of tags on a git repo could be the most transparent way

4. Number of endpoints (possible dimensions for splitting: stream, oem, architecture, region)

I don't have any input on this. I guess it would be nice if we don't have too many options.

5. Related to the answer above, client request parameters (for shared endpoints, to provide client-specific release metadata). Also, should this include a machine-specific ID?

maybe this is part of the discussion about metrics?

6. Push vs pull based (former: release captain signals the server that a metadata change happened; latter: server periodically polls for metadata changes).

No strong opinion, but I think I prefer pull based.

7. Failure mode (if the metadata source is unreachable, should the server provide a stale graph or should it return an error to the client?)

maybe something in between? stale graph if metadata source has been down < 1 day (i.e. intermittent failure) and maybe error if more than that (systemic outage).

cgwalters commented 5 years ago

I am not opposed to any of this - there are some cool advantages - but the whole thing, particularly

Writing an update client that queries a Cincinnati policy engine and invokes rpm-ostree

makes me pause because in layering a totally new thing on top we're partially negating the testing and workflow that exists around ostree/rpm-ostree today.

Typing rpm-ostree upgrade will...error out? We're talking about a new CLI tool/config file to control a new daemon? Or would we teach rpm-ostree upgrade to handle this?

Big picture...it feels like the biggest win would be putting Cincinnati on top of both ostree and rpm-md (i.e. yum/dnf). Then things work more consistently. But that's clearly an even larger endeavor.

bgilbert commented 5 years ago

@cgwalters AIUI Cincinnati is designed to manage a single artifact (or meta-artifact) with a single version number, so I'm not sure how directly it'd map to rpm-md. Unless the version was the Pungi compose ID?

Do you think it'd make sense to support Cincinnati directly in rpm-ostree? I have no idea how relevant our profile of Cincinnati would be to other rpm-ostree-using distros.

The obvious use of rpm-ostree upgrade would be to force an upgrade to the tip of the current ref, bypassing Cincinnati. I'm not sure that that's a reasonable user operation, though. We should certainly provide a way to bypass rate limiting, but that should probably happen via a Cincinnati metadata item, since otherwise we'll bypass update barriers or improperly update EOL versions. So... erroring out might be the best move.

jlebon commented 5 years ago

Yeah, it would be nice indeed if it were possible to keep rpm-ostree as the "front-end" instead of something on top. E.g. rpm-ostree status is pretty nice today for displaying the current state of your OS. And there's all the other verbs too that we could expect users to use (kargs, initramfs, install, db, etc...)

The obvious use of rpm-ostree upgrade would be to force an upgrade to the tip of the current ref, bypassing Cincinnati. I'm not sure that that's a reasonable user operation, though. We should certainly provide a way to bypass rate limiting, but that should probably happen via a Cincinnati metadata item, since otherwise we'll bypass update barriers or improperly update EOL versions.

Re. update barriers, there's a related discussion in upstream libostree: https://github.com/ostreedev/ostree/issues/1228.

If we can get this working in a satisfying way for other libostree users, then we could do something like this:

That's a big if though.

But then it wouldn't be hard to add some integration between the Cincinnati client and rpm-ostree, e.g. making rpm-ostree status print the last time checked for example, or if the service failed, or if the timer is disabled, etc... (these are all checks we do today for rpm-ostreed-automatic.{service,timer}).

bgilbert commented 5 years ago

a manual rpm-ostree upgrade always forces an update, but it still respects update barriers

I'd be concerned about trying to duplicate the full generality of the Cincinnati graph inside rpm-ostree. Even if we use a well-defined subset of the graph functionality today, it would complicate expanding that subset in the future.

And there's all the other verbs too that we could expect users to use (kargs, initramfs, install, db, etc...)

Hmm. Could it make sense to let the distro configure rpm-ostree upgrade to call out to a hook script instead?

bgilbert commented 5 years ago

98 discusses stream metadata, which might be one of the sources of truth for the graph builder.

bgilbert commented 5 years ago

If we decided to exclusively use ostree static deltas for distributing updates, the set of available deltas could be encoded in the Cincinnati graph.

cc @sinnykumari

cgwalters commented 5 years ago

Discussion here with @lucab @jlebon @ajeddeloh @cgwalters

General consensus seemed clear for a separate agent. Agent would be a service talking between Cincinnati and rpm-ostree; but does it have a CLI? We want rpm-ostree to be clear it's being driven by an external agent.

We know we clearly want to separate rebooting. Some discussion now of whether the finalize-delay thing is worth the UX complexity.

Q: Is FCOS agent a container or part of the host? Consensus is for the host; this is core functionaliy.

lucab: Problem with locksmith was direct req on etcd; required learning etcd. But for Kube want a separate thing. The new agent should support calling out to a generic HTTP service. Agent is configured to say which endpoint is used.

jlebon: This solves non-Kubernetes locking

andrew: Come out with a few policies for reboots; "whenever" "at 1am" (walters "wake up system")

walters: How about supporting scripting w/Ansible for non-cluster but multi-node setups.

lucab: push vs pull. Leans towards pull

andrew: For push we could encorage shipping static binaries (go/rust) to the host that implement site-specific dynamic policies.


On communication between rpm-ostree and agent; DBus. Can use API to download+stage. We add an API for "finalize and reboot" which is unlink("/etc/ostree-finalize.lock") - and possibly create it in the new deployment so the new thing is locked.

cgwalters commented 5 years ago

[ C = Cincinnati ]

lucab: How much code can we share with existing code (locksmith and machine-config-operator/pivot). Who does the HTTP pulling with pivot? Answer: podman jlebon: pivot is pretty entertwined with MCO.

walters: The version number is important here - what does C pass us?

lucab: C gives us a DAG. Metadata is opaque.

walters: if C gives us a commit hash, then rpm-ostree already knows it's not supposed to do anything on rpm-ostree upgrade

andrew: Can we expose C barriers to user? lucab: no, not part of the design. Want controlled logic on the server side.

walters: Windows AIUI supports the "force give me the update and bypass C" model

[some discussion of how this relates to streams for both FCOS and RHCOS, how this relates to installers, canary clusters]

Need to create agent now.

Partially depends C server; pushing payloads to it. Depends on infrastructure decisions (Fedora infra?). And testing what we're putting the refs!

cgwalters commented 5 years ago

andrew: C coming as a container will make it easy to spin up for local dev.
lucab: protocol is known, we can mock things.

dustymabe commented 5 years ago

thanks for the updates from devconf discussions colin!

bgilbert commented 5 years ago

Thanks @cgwalters! Let me summarize my understanding of the notes and see if I got it right:

Questions

cgwalters commented 5 years ago

What's the goal of downloading an update without updating the bootloader?

It's nice to have things queued and ready to go. It helps avoid downtime, because you know the time is just rebooting basically. It's quite a different experience - I can say that for sure w/Silverblue, and I'd like to support it nicely for servers too.

crawford commented 5 years ago

Would forced updates still go through Cincinnati, or would they be direct rpm-ostree fetches from the ref?

I think it makes sense for your client and policy to use a parameter to implement this feature. The policy could them choose to ignore rate-limiting when it sees the parameter.

For what it's worth, OpenShift allows the admin to bypass Cincinnati entirely. At the end of the day, Cincinnati is just a hinting engine.

bgilbert commented 5 years ago

What's the goal of downloading an update without updating the bootloader?

It's nice to have things queued and ready to go. It helps avoid downtime, because you know the time is just rebooting basically. It's quite a different experience - I can say that for sure w/Silverblue, and I'd like to support it nicely for servers too.

I guess I'm confused. CL also separates download/install and reboot, but it updates the "bootloader" (actually the installed kernels) at install time, rather than just before reboot. What's the advantage of deferring the bootloader update?

cgwalters commented 5 years ago

but it updates the "bootloader" (actually the installed kernels) at install time,

libostree does that too by default.

What's the advantage of deferring the bootloader update?

It means that rebooting the machine is a predictable operation; let's say one is rebooting because kernel memory is fragmented or something. Doing so doesn't sometimes as a side effect opt you in to an OS update.

(But if there is an OS update and you were going to reboot anyways, then it's highly likely it was already downloaded and so you can save the download time)

jlebon commented 5 years ago

Also, note that in OSTree systems, doing the final bootloader update is coupled with rebooting because of https://github.com/ostreedev/ostree/pull/1503 (which fixed https://github.com/projectatomic/rpm-ostree/issues/40).

bgilbert commented 5 years ago

Also, note that in OSTree systems, doing the final bootloader update is coupled with rebooting because of ostreedev/ostree#1503 (which fixed projectatomic/rpm-ostree#40).

Ah, that makes sense.

It means that rebooting the machine is a predictable operation; let's say one is rebooting because kernel memory is fragmented or something. Doing so doesn't sometimes as a side effect opt you in to an OS update.

I can see the advantage to that under specific circumstances, such as when debugging a problem. I'd be concerned about making it the default, though. One of our core premises is that Fedora CoreOS updates automatically, and users shouldn't have to think about what version they're running. (And also, any version older than the latest is unsupported.) Our defaults should be consistent with that.

lucab commented 5 years ago

@bgilbert your recap is in line with my understanding. Trying to answer your other questions:

cgwalters commented 5 years ago

One of our core premises is that Fedora CoreOS updates automatically, and users shouldn't have to think about what version they're running.

Yep. I think there's a spectrum here though. One thing I've heard that makes a ton of sense is people want to run explicit "canary/dev" clusters that might update automatically, and gate their prod upgrades on the success of their dev clusters. There are a variety of ways to implement this, but what I want to argue here is that the prod cluster here can (and should) still be downloading so that the update is there and ready.

jlebon commented 5 years ago

Was there a conclusion about having a CLI for the agent?

Related to this, we were discussing possibly having the status of the agent being injected in rpm-ostree status directly. The reason being that rpm-ostree status is awesome and already conveys an almost complete picture of the system state. Also, since rpm-ostree will be used directly for other interactions (e.g. rollback/deploy/rebase/install/etc...), it'd be nice if users didn't have to interact with two separate CLIs.

OK, I opened https://github.com/projectatomic/rpm-ostree/issues/1747 for this.

bgilbert commented 5 years ago

@lucab:

splitting "download" and "finalize/reboot" is meant to avoid the case where an unrelated reboot results in an OS update. On the followup concern, applying the update and rebooting is still handled by the agent automatically.

@cgwalters:

One thing I've heard that makes a ton of sense is people want to run explicit "canary/dev" clusters that might update automatically, and gate their prod upgrades on the success of their dev clusters. There are a variety of ways to implement this, but what I want to argue here is that the prod cluster here can (and should) still be downloading so that the update is there and ready.

Okay, I think I understand now. The premise is that the system downloads an update at 1 PM but is configured not to apply it until 11 PM. At 5 PM, the system crashes and reboots. We want the system to boot the old OS version, and then reboot again at 11 PM to apply the update. Is that right?

I was thinking of the case where the user has disabled automatic reboots entirely, which seems to be fairly common on CL. If I have such a system, how will I reboot if I want an update activated, and how will I reboot if I don't? If the user reboots for an unrelated reason, I figured it'd be better to opportunistically apply the update, rather than never applying updates until explicitly requested.

Relatedly, how will the system work if a second update is available before the first one is applied? Will it download that update as well? On CL, nothing further will happen until the machine reboots into the first update, which is not ideal.

@lucab:

the agent would not directly support etcd (nor any other specific DB). The rationale is in trying to design the agent as a client that asks for reboot permissions, instead of being a semaphore manager itself. I would say this is corroborated by the historical tickets on locksmith and it was already hinted at before. We can re-consider adding etcd (v3?) support in the future though, but I think an off-node containerized lock manager would fit better for the cluster scenario.

If this were purely new development, I'd agree. But we'll also want a migration path for users running locksmith + etcd. In principle that could be a new distributed service that runs in the cluster, but I wonder if that wouldn't make CL migration too complex.

forced (i.e. manual) updates would go directly via rpm-ostree, bypassing Cincinnati.

That would make it harder for #98 metadata to serve as the single source of truth. To pin upgrades at an older version, we'd need to update the stream metadata, rebuild the graph, and also update the ostree ref. (I'm not sure how much of a problem that is. In principle we might want to do it anyway.)

if the agent doesn't handle the locks itself, there isn't an urgent need for a CLI client.

Sure. I was thinking more of the force-update case, if done via Cincinnati.

I don't have an answer on refs, I think we punted that here as the cincinnati payload is opaque. Do you think there is a blocker that should be considered and addressed in this context?

Nope, just curious.

bgilbert commented 5 years ago

@lucab and I discussed further OOB. I see the point about keeping etcd support out of the agent. For users migrating from CL, we could provide a containerized lock manager which synchronizes with etcd, and a migration guide which recommends running that container on every node. New clusters wouldn't be deployed with that model, but it'd provide a way for existing non-Kubernetes clusters to get migrated quickly.

lucab commented 5 years ago

I started experimenting with such a containerized lock manager, findings at https://github.com/coreos/fedora-coreos-tracker/issues/3#issuecomment-465953659.

I also started experimenting with a minimal on-host agent, trying to understand what are the configuration axis, the state machine and introspectable parameters. On the last topic, the agent needs to discover at runtime and in an off-line way the following details about the OS:

lucab commented 5 years ago

My current experiment is at https://github.com/lucab/exp-zincati. I think we have tickets in place (and in progress) for all the high level dependencies I saw while sketching that.

In order to progress toward stabilizing that, I'd like to get a repo somewhere under coreos orga and polish+move code there. For this, we need to come up with a proper name for this component that sits between airlock, cincinnati and rpm-ostree (my current placeholder is "zincati" for no interesting reasons).

/cc @ajeddeloh @arithx @LorbusChris as my usual go-to folks for name bikeshedding

arithx commented 5 years ago

praeceptor, houston, lcm

jtligon commented 5 years ago

road-trip, minicar

lucab commented 5 years ago

Related to the rockets / engines / jet theme: flameout (I'm not sure if it has an intrinsic negative connotation, though)

bgilbert commented 5 years ago

@lucab It does. :grin:

arithx commented 5 years ago

From fcos IRC meeting: caro

LorbusChris commented 5 years ago

coast, coaster

lucab commented 5 years ago

Earth-related: fallower (or similar variations), as it applies crop rotation to nodes in a cluster.

Otherwise from the other enginer-related thread we still have scramjet, pilotlight and bridgewire that could fit.

crawford commented 5 years ago

update-agent? (Is that what this thing is?)

dustymabe commented 5 years ago

spinbridge

zincati (current name) seems cool. fallower and bridgewire are interesting too

jlebon commented 5 years ago

I like zincati too. Has some resemblance with Cincinnati (I'm guessing that was the purpose?) but pronounced differently enough.

dustymabe commented 5 years ago

zincati it is! https://github.com/coreos/zincati - go forth!

lucab commented 4 years ago

This has been fully implemented in Zincati and the FCOS Cincinnati backend at this point, I'm closing the ticket.

There are further specific usecases and docs tasks which are ongoing and tracked by dedicated tickets, e.g. https://github.com/coreos/zincati/pull/116 and https://github.com/coreos/fedora-coreos-tracker/issues/240.