Open bgilbert opened 3 years ago
I like the flag files as a consistent way to gate and modify migratory defaults over time. Often we know a change (e.g. cgroups v2) is being made to channels over time and need to choose when to take action in a higher level distro (e.g. do we do what is best for the stable channel or testing at a point in time). The flags could help declare well in advance of the migration taking effect on a channel.
I'm afraid that having feature toggles could get out of hands quickly and significantly increase the test matrix as we would have to test all combinations of all features. I think we should have something similar to Rust editions instead:
edition: 2021-06-21
When you set an edition in you Butane config you get all the defaults features that were enabled at that date and not later. When we introduce new major changes, they are gated behind a new edition value.
This keeps the deployment behavior consistent across releases while making it easy to adapt / update to the latest one. Butane should then warn about configs without editions or assume it is the first one by default. We could even have different sugar per edition (although that might not be a good idea).
The issue with this behavior is that the configuration will keep growing forever has we introduce changes.
The other alternative is to have either support for editions in Ignition or in another binary that would perform first boot setup depending on the edition (i.e. do the work to move an image from latest to the previous edition). But this is getting much more complicated and would duplicate Ignition logic which might not be a good idea.
I guess having a list of empty files such as /etc/fedora-coreos-editions/2021-06-01
would be harmless and would simplify the logic for units like you suggested.
Yeah, I think the edition idea is worth pursuing. As you say, a single edition value is a lot easier to write in a config than a growing list of flags.
I'm imagining that the edition would be configured by writing a single file such as /etc/fedora-coreos/edition
; there would be Butane sugar for this. We could then have some early boot code (systemd generator?) that would convert that file into a set of feature flags in /run
for the convenience of other systemd units. That way, Butane wouldn't need to know the semantics of different edition values, preserving its forward compatibility with new OS releases.
Editions don't let us avoid testing combinations of features, though, because there must be a way to override the individual behaviors implied by an edition. For example, our docs would need to have a table like this fictional one:
Edition | New with this edition | How to revert |
---|---|---|
2021-01-01 | Enable cgroupsv2 by default | Add xyz to kernel_arguments.should_exist |
2021-02-15 | Enable systemd-oomd by default | Mask systemd-oomd.service |
2021-05-01 | Enable zram swap by default | Write memory.compression=0 to /etc/sysctl.d/no-zram.conf |
If we didn't allow this, and a particular user needed to e.g. stay on cgroupsv1 for reasons beyond their control, they could be stuck on an old edition for months or years. Then, when they were ready to update to the current edition, they'd need to adapt to several behavior changes at once, and they'd need to do it in exactly the order we specify. That would limit our ability to deliver new functionality into widespread use, and to eventually remove support for old editions.
Editions don't let us avoid testing combinations of features, though, because there must be a way to override the individual behaviors implied by an edition
This is my chief criticism on the editions idea. In principal, I think it fine, but to the sys admin, it abstracts the feature into a collection that may be too rigid; that could effectively render the feature unused as it would be easier to do the masking themselves. The developer in me loves the idea of editions. The former sys admin wants something that's clear in meaning and doesn't require a reference look up.
could effectively render the feature unused as it would be easier to do the masking themselves.
I wonder if that's too strong. The goal is to configure our defaults; hopefully those defaults will be useful for many use cases without further tweaking. I agree that users wouldn't necessarily seek out this feature, but we'd strongly encourage its use in getting-started guides and similar.
I'd have direct use for the feature gates. Usually its known (or becomes clear) which features need to be disabled for clusters or whatever reason. Having "editions" as sugar over the features, would require finding docs that map an edition to what that means for features and then using a combination of edition + feature overrides, which seems more complicated imo. Would rather just enumerate (e.g. cgroupsv2 yes, oomd no) divergences from defaults, especially if users are able to add those flags in advance of them taking effect (e.g. oomd default may change, but I'd like to go ahead and say disable).
Agreed that the testing story seems independent. As long as there are configurables, you'll want bug reports to include the set of feature toggles in use, if any.
In the meeting yesterday there was a lot of discussion on this topic (see notes and video). A couple of noteworthy threads:
One final thought towards the end of the meeting was:
This issue was created in the context of https://github.com/coreos/fedora-coreos-tracker/issues/880 and that issue was created because we have a few specific changes we want to make. We do need to address the underlying problem right now. A rough way to do that is to create documentation for Kubernetes Distributions on the particular configurations they might want to employ (alongside their provisioning instructions for deploying k8s).
The documentation would allow us some time to refine this feature flags proposal and determine what we need longer term, while unblocking the few features we have in the pipeline.
I see several issues with feature flags:
This is why I suggested the edition gate as this will enable us to disable unknown future features by default while enabling a specific set of features known at a given point in time. This decouples the time the config is created and transpiled from the edition time. Butane can also error out on unknown editions.
But in the end, having a table with correspondence between editions and features and how to disable / choose them ends up being really close to having a doc page for the major changes that we make and the edition feature would probably require us to do more work for testing that it actually work.
Edit: Thus I'm in favor of documentation only for now :)
At present, when we want to make a breaking change, we:
This is all handled ad-hoc and is a non-trivial amount of work. Also, none of it is machine-readable. Cluster administrators and developers of layered software (e.g. a Kubernetes distribution) need to manually monitor coreos-status and push any needed code/config updates by the migration deadline, or things will break. Now that we're discussing optimizing our defaults for the non-Kubernetes case (#880) this is even more relevant for layered software.
Consider adding a structured mechanism for enabling alternative OS behavior. This might be e.g. a series of flag files in
/etc/fedora-coreos
, or an/etc/fcos-features.conf
. In cases where a behavior change is expected to break existing deployments (e.g. #292, #840, #859), we'd add a flag, and FCOS would default to the old behavior unless the flag is set. We'd update our getting-started docs to enable the new behaviors by default, and add a documentation page listing the flags and recommended values. We could also add Butane sugar:The underlying implementation could be as simple as
ConditionPathExists=/etc/fedora-coreos/oomd
. Kernel arguments are more awkward, but there could be initrd code to extract the configured features from the Ignition config and inject the corresponding kargs.This mechanism would allow us to be more aggressive about changing behavior, and would give administrators and layered software more assurance that breaking changes won't occur at inconvenient times. It doesn't need to handle every possible breaking change, and doesn't prevent us from eventually removing the old behavior; we can still perform a manual deprecation as before. It just provides a cleaner way to ship incompatible changes that's more flexible for both developers and users.
cc @dghubble for comments.