coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
262 stars 59 forks source link

Develop Fedora CoreOS layering user stories #1219

Open jlebon opened 2 years ago

jlebon commented 2 years ago

In this ticket, let's come up with the various use cases that CoreOS layering enables for Fedora CoreOS. This will allow us to have more targeted discussions about each of them and evaluate (1) whether they're worth pursuing, (2) what the UX would look like, and (3) how it would be implemented.

Proposed documentation for use cases/explanation:

CoreOS Layering Use Cases

The techology we are referring to a "CoreOS Layering" allows users/projects to easily build derivative works that build on top of Fedora CoreOS. The build experience re-uses the container build workflow (i.e. Dockerfile or Containerfile), which is pervasive across the industry. The output of this build is a container image that can be hosted in a registry. Existing systems can be rebased to this container image and follow updates from there.

This is a significant change to the tooling around building new OSTree commits based on CoreOS. This is a net new set of features/tooling. It doesn't change any current tooling users may use and does not require users to make any changes to their current setup.

Here are some current use cases that are made possible or easier with CoreOS Layering:

As an End User

Currently Fedora CoreOS allows users to modify the system either through Ignition, by writing to any read/write directories, or via RPM package layering on the client side. These two approaches have served us well, but some use cases can be improved.

First, delivering configuration/files/software via Ignition works well, but can get heavy and also requires the user to re-provision if the configuration ever changes; Ignition is only supported for the first boot of a machine. Secondly, client side package layering can take a lot of time to run on first boot, makes upgrades less reliable, and isn't tightly controlled (software in package repositories change often).

It may be more desirable to deliver all of the changes as a derivative layer built as a container image, and delivered on first boot. We think this will be attractive in the following uses:

Detailed Scenario: A user does a bare metal install of 10 systems in a datacenter. The user later discovers they should have deployed the systems with bonded networking setup With CoreOS Layering this change can be made to the Dockerfile definition, rebuilt, and delivered as an update. Without CoreOS Layering the recommended way would be to re-install/re-provision the machines; which would represent a significant waste of time for this user.

Detailed Scenario: One example here that is compelling would be the case of third party kernel modules. A user can do a new multi-stage container build based on a recently delivered Fedora CoreOS base container. This multi-stage build can detect the delivered kernel in the base container, build the kernel module from source, and copy the results into the target container image. This committed container can now be pushed to a registry and and clients can target that image for updates.

Detailed Scenario: To illustrate this use case further we can walk through a client side package layer scenario. Machine A and machine B exist and are following the stable stream. Both have the NetworkManager-wifi package layered. A new stable update released on Monday. Machine A updates on Monday, early in the rollout window; the client pulls in the new update and pulls NetworkManager-wifi from the package repo, making a new client side commit. On Tuesday Machine B attempts to update. At this point a few things could go wrong:

  1. The package repo is unavailable. In this case client side package layering operation will fail. Machine B stays on the old commit and keeps retrying.
  2. The package layering is successful, but pulls in a different version of NetworkManager-wifi than Machine A.

In both of these cases Machine A and Machine B, which are expected to be more or less running the same exact software, have diverged. Further, in scenario 1. if the package repository never comes back (i.e. using a repo from a third party) that machine is stuck forever.

As a Layered Project

Fedora CoreOS provides a nice stable base for other projects to build on top of, however every decision Fedora CoreOS makes isn't always right for layered projects. Currently the layered project will need to either decide to encode every change into an Ignition configuration that runs on boot of every instance or rebuild a brand new OSTree completely from scratch.

CoreOS Layering offers the opportunity for layered projects to easily make tweaks to Fedora CoreOS. Some examples of layered projects as of July 2022 that take different approaches:

With CoreOS Layering these projects can provide a more polished solution for end users:

Potential Drawbacks of CoreOS Layering

The CoreOS Layering technology is still under active development. There are currently some workflows that haven't been fully fleshed out. Here is a summary:

dustymabe commented 2 years ago

@jlebon and I got together to try to flesh out the CoreOS Layering use cases.

I have updated the description of this issue (2022-07-20) with some proposed use cases where the value of CoreOS layering is illustrated. I have also described current limitations of CoreOS Layering that we'll be working to address in the coming months. Please take a look and let us know how this could be improved and if any corrections need to be made.

cgwalters commented 2 years ago

Looked through this; seems sane. Thanks so much for writing this up!

The techology we are referring to a "CoreOS Layering"

That said, https://fedoraproject.org/wiki/Changes/OstreeNativeContainer currently proposes "ostree native container". About 90% of all the stuff written here applies outside of Fedora CoreOS too. Something to keep in mind.

cgwalters commented 2 years ago

Edit (jlebon): comment moved to https://github.com/coreos/fedora-coreos-tracker/issues/1263#issuecomment-1191702805

But regarding zincati specifically:

half-baked strawman: embed barriers in the container image

We encode "epochs"/barriers in like this:

We embed metadata in the container images that says that quay.io/coreos-assembler/fcos:stable-v1 is the successor to quay.io/coreos-assembler/fcos:stable-v0.

This is related to https://github.com/ostreedev/ostree/pull/874

cgwalters commented 1 year ago

I think chunks of this are now covered by the existence of https://github.com/coreos/layering-examples right?

If it's about trying to just explain the benefits and drawbacks, we could just move that into https://github.com/coreos/fedora-coreos-docs/pull/540 ?

cgwalters commented 1 year ago

I think what this issue is trying to get at really is the baseline question of when should users:

I think for 90% of this there's really nothing FCOS specific about this in the end (this is why the term "coreos layering" is misleading as a technology descriptor). IOW all of these tradeoffs are things that also apply to other rpm-ostree based systems. Particularly relevant to, but not limited to desktop ones.

So my instinct here is to:

?

bgilbert commented 1 year ago

Fedora CoreOS docs describe lots of things that technically repeat documentation elsewhere. We should aim to be maximally helpful to new users, rather than asking them to assemble the opinions of various upstreams.

But also, FCOS is opinionated in various ways, and "when should you use Ignition vs. layering" seems like an important thing to have an opinion about. It affects not only the advice we give users, but our priorities for the functionality we build, and indeed how we think about the distro as a whole.

cgwalters commented 1 year ago

Yes, I have some! I added some bits from this in https://github.com/coreos/fedora-coreos-docs/pull/540#issuecomment-1534694065

But I'd hope you have specific opinions on this too that could be expressed here in the doc directly - would love to make this feel more collaborative.

cgwalters commented 1 year ago

I made this point in an OpenShift meeting, but I wanted to write it down here; going back one level, when I was saying this isn't FCOS specific in that it also applies to other rpm-ostree systems - in fact even if rpm-ostree (and coreos etc.) didn't exist, this problem also exists today in RHEL.

The day that RHEL introduced Image Builder, suddenly there are two ways to set up that Postgresql server in Azure (start from stock cloud image, maybe use cloud-init/ansible/whatever and yum install postgresq) or build golden disk images with IB and do updates by instance teardown/spinup. These things have the same very fundamental tradeoffs around systems management that we're introducing here.

I think indeed, it is on us to provide guidance. But I don't think there's any real way to not support these two paths ("configure" vs "build/own").

Now, what I hope actually is this fundamental change in mindset and technology eventually leads us to a place where we have a more "seamless" spectrum into what is e.g. today Fedora Cloud and Fedora Server, instead of having harder barriers.