Kubernetes Update Operator

mitchellmaler commented 5 years ago

Currently on CoreOS Container Linux we make use of the container linux update operator to orchestrate the updates (restart) of our Kubernetes cluster nodes based on it's configuration and agent integrating with locksmith. Will there be an equivalent for Fedora Coreos that can be deployed to a Kubernetes cluster and work with zincati to orchestrate updates?

I noticed the airlock project which can run as a container and needs to connect to an etcd3 server (cluster) but while running under kubernetes we already have etcd nodes but cannot give access to those (policy). Does this mean we are required to run another etcd cluster just for updates or is it possible to make use of kubernetes objects to orchestrate the updates using an operator?

lucab commented 5 years ago

All questions are really on spot but many pieces are still moving, so I'll try to give an overview of the current state (which may change soonish). Please do note that this could fit into a larger discussion around k8s-updates / config-management / machine-config-operator, but I'll keep the scope of this ticket to "FCOS update-reboot orchestration on k8s" only, on purpose.

For reference, the historical decisions behind this are recorded at https://github.com/coreos/fedora-coreos-tracker/issues/3.

Does this mean we are required to run another etcd cluster just for updates or is it possible to make use of kubernetes objects to orchestrate the updates using an operator?

That isn't the intended usage, no. The scope of airlock is just to replace the same logic in locksmith, which only supported etcd as distributed backend. The usecase is for machines that already have direct access to an etcd cluster, likely without any access to the objects of an higher-level orchestrator. If you have to deploy an etcd cluster just for airlock, then there are better options to consider.

Will there be an equivalent for Fedora Coreos that can be deployed to a Kubernetes cluster and work with zincati to orchestrate updates?

That's the idea, yes. But we don't plan to write orchestrators for each possible backend on our own, nor shove all of those into airlock. Instead, the plan is to stabilize the HTTPS-based protocol that Zincati uses, so that the reboot-manager can run in a separate container and its implementation can be swapped to support other backends. Within this context, each community with a common interest can maintain its own containerized manager, decoupled from the OS and from other backends/implementations.

As of this date, we are still stabilizing the basics of auto-updates, so fleet-wide orchestration is still on the development radar. The protocol is currently drafted at https://github.com/coreos/airlock/pull/1, while the client-support in Zincati is tracked at https://github.com/coreos/zincati/issues/37.

mitchellmaler commented 5 years ago

@lucab Thanks for the overview! I am glad there will be similar functionality in the future.

LorbusChris commented 5 years ago

Right now in Red Hat OpenShift we have the machine-config-operator (mco) for this. In the initial release of OKD4 it will do the FCOS updates instead of the airlock/zincati duo that usually does it in FCOS, and using a slightly different delivery update payloadmechanism (ie. os-container aka container embedded ostree vs usual rpm-ostree commit). We will do our best to abstract away the interfaces for those controllers and make them replaceable/pluggable (in way that would allow Zincati/Airlock to control how mco/the cluster does things)

MPV commented 4 years ago

@lucab https://github.com/coreos/zincati/issues/37 now seems closed, would you be open to sharing what the current state is? 😍

lucab commented 4 years ago

@MPV I've left a few cross-links in place, so if you want to explore more feel free to click-through. However, below is a quick summary of the current status.

client-side logic is done, see https://github.com/coreos/zincati/blob/master/docs/usage/updates-strategy.md#lock-based-strategy
the server-side logic to replace locksmith etcd strategy is done, see https://github.com/coreos/airlock
OKD4 decided not to use FCOS auto-updates, orchestrating everything via https://github.com/openshift/machine-config-operator instead
I am not aware of any k8s-native reboot-lock-manager implementation at this point. And to the best of my knowledge there is isn't any plan on our (@coreos) side to write such operator.

Circling back to my original reply, now we are basically at this point:

[...] the reboot-manager can run in a separate container and its implementation can be swapped to support other backends. Within this context, each community with a common interest can maintain its own containerized manager, decoupled from the OS and from other backends/implementations.

schmitch commented 4 years ago

I do not get the point why airlock was done with etcd instead of k8s as a backing store. I think airlock should actually be configurable to use k8s locking mechanisms.

Edit: the question is also, what happens if airlock is only installed on 1 node and the node restarts, does the lock still stands or does the node retries until the airlock server is up again? if the latter is the quase, it will probably be really simple to create a good k8s integration.

lucab commented 4 years ago

I do not get the point why airlock was done with etcd instead of k8s as a backing store.

This is recorded with actual historical details and technical discussions at https://github.com/coreos/fedora-coreos-tracker/issues/3, feel free to go through it. The TLDR is "because it replaces locksmith etcd strategy".

Also, please beware that k8s API does not model a database with strongly consistent primitives (e.g. old HA clusters without "etcd quorum read" do return stale reads).

I think airlock should actually be configurable to use k8s locking mechanisms.

That's understandable, but its design scope is explicitly not covering it. There are plenty of details to figure out (authentication, consistency, hooks, tolerations, draining, etc.) to warrant its own project by somebody intimately knowledgeable with k8s. See the rest of the discussion about having dedicated containerized lock-managers.

The client->server protocol itself is documented at https://github.com/coreos/airlock/pull/1/files and designed to be easy to implement as small web-service on top of any consistent database.

schmitch commented 4 years ago

the pr actually points to a rough explanation. not to a "protocol documentation".

mitchellmaler commented 4 years ago

Just saw this new project being worked on by Rancher to be a more generic upgrade operator not just rancher specific. Wonder if it could be enhanced to work with Fcos upgrades. It might even be able to work as it is, need to dig into it more.

https://github.com/rancher/system-upgrade-controller

lukasmrtvy commented 4 years ago

Wait, what https://docs.fedoraproject.org/en-US/fedora-coreos/faq/#_how_do_i_coordinate_cluster_wide_os_updates_is_locksmith_or_the_container_linux_update_operator_available_for_fedora_coreos ?

bgilbert commented 4 years ago

@lukasmrtvy Excellent question! @lucab, do you know what that text is about?

dustymabe commented 4 years ago

Looks like that text was part of our annoucement lauch FAQ posted in June of 2018, so it may have been a little misguided or incorrect in retrospect.

lucab commented 4 years ago

Bunch of updates:

the FAQ entry above is stale, it references the first step of the MCO. https://github.com/coreos/fedora-coreos-docs/pull/158 updates it to current state.
third-party lock-manager implementations are starting to appear: https://github.com/opencounter/terraform-fleet-lock-dynamodb
integration with Rancher's system-upgrade-controller is tracked at https://github.com/rancher/system-upgrade-controller/issues/87

dghubble commented 4 years ago

https://github.com/poseidon/fleetlock implements Zincati's FleetLock protocol on Kubernetes. Its small, nothing fancy (no drain).

curantes commented 2 years ago

https://github.com/poseidon/fleetlock implements Zincati's FleetLock protocol on Kubernetes. Its small, nothing fancy (no drain).

It actually have drain support now

coreos / fedora-coreos-tracker

Kubernetes Update Operator #241