Migration API for breaking changes

bgilbert commented 3 years ago

At present, when we want to make a breaking change, we:

Announce the change on the coreos-status mailing list
Document a mechanism for selecting the old or new behavior with a Butane config, if possible
Change the OS to warn if the old behavior is enabled
Wait for the end of the deprecation period
Ratchet the change through the streams, perhaps involving further docs updates and coreos-status messages

This is all handled ad-hoc and is a non-trivial amount of work. Also, none of it is machine-readable. Cluster administrators and developers of layered software (e.g. a Kubernetes distribution) need to manually monitor coreos-status and push any needed code/config updates by the migration deadline, or things will break. Now that we're discussing optimizing our defaults for the non-Kubernetes case (#880) this is even more relevant for layered software.

Consider adding a structured mechanism for enabling alternative OS behavior. This might be e.g. a series of flag files in /etc/fedora-coreos, or an /etc/fcos-features.conf. In cases where a behavior change is expected to break existing deployments (e.g. #292, #840, #859), we'd add a flag, and FCOS would default to the old behavior unless the flag is set. We'd update our getting-started docs to enable the new behaviors by default, and add a documentation page listing the flags and recommended values. We could also add Butane sugar:

features:
  - oomd
  - zram

The underlying implementation could be as simple as ConditionPathExists=/etc/fedora-coreos/oomd. Kernel arguments are more awkward, but there could be initrd code to extract the configured features from the Ignition config and inject the corresponding kargs.

This mechanism would allow us to be more aggressive about changing behavior, and would give administrators and layered software more assurance that breaking changes won't occur at inconvenient times. It doesn't need to handle every possible breaking change, and doesn't prevent us from eventually removing the old behavior; we can still perform a manual deprecation as before. It just provides a cleaner way to ship incompatible changes that's more flexible for both developers and users.

cc @dghubble for comments.

dghubble commented 3 years ago

I like the flag files as a consistent way to gate and modify migratory defaults over time. Often we know a change (e.g. cgroups v2) is being made to channels over time and need to choose when to take action in a higher level distro (e.g. do we do what is best for the stable channel or testing at a point in time). The flags could help declare well in advance of the migration taking effect on a channel.

travier commented 3 years ago

I'm afraid that having feature toggles could get out of hands quickly and significantly increase the test matrix as we would have to test all combinations of all features. I think we should have something similar to Rust editions instead:

edition: 2021-06-21

When you set an edition in you Butane config you get all the defaults features that were enabled at that date and not later. When we introduce new major changes, they are gated behind a new edition value.

This keeps the deployment behavior consistent across releases while making it easy to adapt / update to the latest one. Butane should then warn about configs without editions or assume it is the first one by default. We could even have different sugar per edition (although that might not be a good idea).

travier commented 3 years ago

The issue with this behavior is that the configuration will keep growing forever has we introduce changes.

The other alternative is to have either support for editions in Ignition or in another binary that would perform first boot setup depending on the edition (i.e. do the work to move an image from latest to the previous edition). But this is getting much more complicated and would duplicate Ignition logic which might not be a good idea.

travier commented 3 years ago

I guess having a list of empty files such as /etc/fedora-coreos-editions/2021-06-01 would be harmless and would simplify the logic for units like you suggested.

bgilbert commented 3 years ago

Yeah, I think the edition idea is worth pursuing. As you say, a single edition value is a lot easier to write in a config than a growing list of flags.

I'm imagining that the edition would be configured by writing a single file such as /etc/fedora-coreos/edition; there would be Butane sugar for this. We could then have some early boot code (systemd generator?) that would convert that file into a set of feature flags in /run for the convenience of other systemd units. That way, Butane wouldn't need to know the semantics of different edition values, preserving its forward compatibility with new OS releases.

Editions don't let us avoid testing combinations of features, though, because there must be a way to override the individual behaviors implied by an edition. For example, our docs would need to have a table like this fictional one:

Edition	New with this edition	How to revert
2021-01-01	Enable cgroupsv2 by default	Add `xyz` to `kernel_arguments.should_exist`
2021-02-15	Enable systemd-oomd by default	Mask `systemd-oomd.service`
2021-05-01	Enable zram swap by default	Write `memory.compression=0` to `/etc/sysctl.d/no-zram.conf`

If we didn't allow this, and a particular user needed to e.g. stay on cgroupsv1 for reasons beyond their control, they could be stuck on an old edition for months or years. Then, when they were ready to update to the current edition, they'd need to adapt to several behavior changes at once, and they'd need to do it in exactly the order we specify. That would limit our ability to deliver new functionality into widespread use, and to eventually remove support for old editions.

darkmuggle commented 3 years ago

Editions don't let us avoid testing combinations of features, though, because there must be a way to override the individual behaviors implied by an edition

This is my chief criticism on the editions idea. In principal, I think it fine, but to the sys admin, it abstracts the feature into a collection that may be too rigid; that could effectively render the feature unused as it would be easier to do the masking themselves. The developer in me loves the idea of editions. The former sys admin wants something that's clear in meaning and doesn't require a reference look up.

bgilbert commented 3 years ago

could effectively render the feature unused as it would be easier to do the masking themselves.

I wonder if that's too strong. The goal is to configure our defaults; hopefully those defaults will be useful for many use cases without further tweaking. I agree that users wouldn't necessarily seek out this feature, but we'd strongly encourage its use in getting-started guides and similar.

dghubble commented 3 years ago

I'd have direct use for the feature gates. Usually its known (or becomes clear) which features need to be disabled for clusters or whatever reason. Having "editions" as sugar over the features, would require finding docs that map an edition to what that means for features and then using a combination of edition + feature overrides, which seems more complicated imo. Would rather just enumerate (e.g. cgroupsv2 yes, oomd no) divergences from defaults, especially if users are able to add those flags in advance of them taking effect (e.g. oomd default may change, but I'd like to go ahead and say disable).

Agreed that the testing story seems independent. As long as there are configurables, you'll want bug reports to include the set of feature toggles in use, if any.

dustymabe commented 3 years ago

In the meeting yesterday there was a lot of discussion on this topic (see notes and video). A couple of noteworthy threads:

We could have both feature flags and editions (an edition just maps to a set of enabled/disabled feature flags).
- The feature flag is the primitive and the edition is the grouping.
Will feature flags be dynamically applied on every boot, or just on first boot?
- much easier to just apply on first boot
- much harder to generically do for every feature on every boot
Should we limit this feature toggling scope just at the kubernetes use case?

One final thought towards the end of the meeting was:

Is implementing this functionality required right now?

This issue was created in the context of https://github.com/coreos/fedora-coreos-tracker/issues/880 and that issue was created because we have a few specific changes we want to make. We do need to address the underlying problem right now. A rough way to do that is to create documentation for Kubernetes Distributions on the particular configurations they might want to employ (alongside their provisioning instructions for deploying k8s).

The documentation would allow us some time to refine this feature flags proposal and determine what we need longer term, while unblocking the few features we have in the pipeline.

travier commented 3 years ago

I see several issues with feature flags:

If we say that a feature is only enabled if it is specified in a Butane config, then by default no features will be enabled which is a bad UX, going against the purpose of single node defaults. The list of features to tell users to add to their config will thus either grow infinitely or will have to be regularly trimmed.
If we say that by default all Butane known features are enabled unless explicitly disabled, this means that you will get different features enabled depending on the Butane version used to build the config. This is also surprising and bad UX.
If we say that a feature is enabled by default unless explicitly disabled, then this does not solve our problem as this won't enable provisioning new nodes with an older config while not enabling new features.

This is why I suggested the edition gate as this will enable us to disable unknown future features by default while enabling a specific set of features known at a given point in time. This decouples the time the config is created and transpiled from the edition time. Butane can also error out on unknown editions.

But in the end, having a table with correspondence between editions and features and how to disable / choose them ends up being really close to having a doc page for the major changes that we make and the edition feature would probably require us to do more work for testing that it actually work.

Edit: Thus I'm in favor of documentation only for now :)

coreos / fedora-coreos-tracker

Migration API for breaking changes #892