coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
262 stars 59 forks source link

cgroups v2 strategy #292

Closed bgilbert closed 3 years ago

bgilbert commented 4 years ago

Fedora 31 is planning to switch to cgroups v2 by default. The explicit goal is for Fedora to drive the leading edge, encouraging components that don't fully support v2 to do so. However, this may not yet be the best choice for Fedora CoreOS, which aims to provide broad and reliable container support for production environments.

We have the ability to make a different decision here than the rest of Fedora. However, if we ship the stable release with cgroups v1, we'll need to think about compatibility constraints when/if we later switch to v2.

We should define a strategy for the cgroups v2 transition.

bgilbert commented 4 years ago

As a data point, Docker in F31 doesn't run on cgroups v2 and the problem has been rejected as a Fedora release blocker.

jlebon commented 4 years ago

@rhatdan We discussed this during the last community meeting and agreed that shipping with v1 for now is the safer course of action. One question we had was: is there anything we can do now to make migration to v2 easier later?

rhatdan commented 4 years ago

Switching to crun by default, would help. But I guess you should shift on Fedora 32. Since we have made the change, we are witnessing a lot of movement in runc and k8s towards cgroup v2.

dustymabe commented 4 years ago

We found an issue with the current approach specifically when it comes to handling the F30->F31 update. We'll proceed with trying the workaround described in that issue comment.

dustymabe commented 4 years ago

In the meeting from a few weeks ago we said:

this ticket is mostly done but we'll try to create docs and use fcct kargs support to show how to set it to v1 or v2

I opened https://github.com/coreos/fcct/issues/57 for the remaining items.

lucab commented 4 years ago

Before stepping into docs and fcct, it is unclear to me how we plan to tackle kargs configuration at the Ignition/initramfs level.

Right now we can arrange things (via Ignition) so that the BLS is changed after the first boot, but that requires a reboot to apply (thus is doesn't affect the first boot). Did I miss anything?

jlebon commented 4 years ago

Before stepping into docs and fcct, it is unclear to me how we plan to tackle kargs configuration at the Ignition/initramfs level.

Yup, that's https://github.com/coreos/ignition-dracut/issues/81. The status on that is essentially:

  1. fix https://github.com/ostreedev/ostree/issues/479 (https://github.com/ostreedev/ostree/pull/1836)
  2. get ostree working in the initrd
  3. add ignition-dracut service that detects kargs.d dropins
jtligon commented 4 years ago

with https://github.com/containerd/containerd/issues/3726 merging, do we need to look at this again?

dustymabe commented 4 years ago

Yeah. It needs to land in upstream releases and that needs to hit Fedora so we can pick it up, not sure if it will happen for fedora 33.

Anybody know what the kubernetes cgroups v2 status is now?

giuseppe commented 4 years ago

the plan is to have cgroup v2 support in Kubernetes 1.19

miabbott commented 3 years ago

Per a message[1] to the Fedora CoreOS mailing list from @Conan-Kudo, Docker/Moby 20.10 now has cgroups v2 support.

[1] https://lists.fedoraproject.org/archives/list/coreos@lists.fedoraproject.org/thread/MIZ3FTYH46AW3TRTZ44S5CTW2CNIRLZ4/

LorbusChris commented 3 years ago

@olivierlemasle seems to be Fedora RPM maintainer.

Hi Olivier, could you give us an ETA for the new Moby release landing in Fedora? :innocent:

dustymabe commented 3 years ago

I don't anticipate the new version landing before Fedora 34. I think we should put this on our radar for switching FCOS to cgroups v2 around the time Fedora 34 is released.

olivierlemasle commented 3 years ago

Hi, I'm working on it. It will be packaged in Rawhide, so for Fedora 34.

jlebon commented 3 years ago

@giuseppe Can you confirm what the latest state of cgroupsv2 in Kubernetes is? I see that https://v1-19.docs.kubernetes.io/docs/setup/release/notes/ says:

Support for running on a host that uses cgroups v2 unified mode

But there's also https://github.com/kubernetes/enhancements/issues/2254 which you recently filed implying there's still work left to do there.

giuseppe commented 3 years ago

But there's also kubernetes/enhancements#2254 which you recently filed implying there's still work left to do there.

support for cgroup v2 is present since 1.19.

The additional work left to do is to support cgroup v2 new features that were not present in cgroup v1.

travier commented 3 years ago

During the meeting, we decided to move forward with cgroups v2 by default for new installations in F34. If we also want to automatically update existing systems to cgroups v2, we have to confirm that this will work with existing containers.

  • AGREED: now that kubernetes 1.19+ and the docker available in f34+ support cgroups v2 we will switch FCOS to default to cgroups v2. Currently our plan is to default to it on new nodes and leave existing upgraded nodes on whatever cgroups version they have been running on. If we consult cgroups experts and they tell us that is not necessary and migrating everyone should be fine, then we'll consider migrating nodes. (dustymabe, 17:08:28)
rhatdan commented 3 years ago

I would leave existing nodes with cgroups V1. There is a chance that existing containers can break. The biggest breakage I know of is RHEL7/Centos7 init containers that use systemd. The issue is that systemd in RHEL7 does not understand cgroups V2.

rhatdan commented 3 years ago

It would also be nice to move to crun by default rather then runc, this would shrink the size of your image and use less memory when containers are starting. Getting rid of Docker and using podman-docker, would be nice, but that might be a bridge too far. But it would save you a lot of disk space as well.

jtligon commented 3 years ago

If we're at an inflection point where we have to tell everyone to rebuild their containers, it would be great to rip off a bunch of band-aids at once as opposed to a new band-aid every release. what other defaults do we want to address that are listed in the Issue Tracker?

Conan-Kudo commented 3 years ago

Hi, I'm working on it. It will be packaged in Rawhide, so for Fedora 34.

@olivierlemasle Any progress on getting moby-engine updated to 20.x version?

travier commented 3 years ago

moby-engine has not been updated to 20.x releases yet so I don't know if this will happen for Fedora 34. I think we should re-discuss our plans here on the next community meeting.

As we can not keep waiting for Moby to catch-up forever, I'm in favor of switching to cgroupsv2 by default for new installations, keeping existing installations as is and making sure we have documentation for all cases (updating to v2 for existing installs, staying on v1 for new moby users).

rhatdan commented 3 years ago

I would suggest you drop it, and go with Podman 3.0 providing Docker API through APIV2. Sticking with cgroups V2 is just a bad decision.

You can shrink the size of Fedora CoreOS by eliminating moby-engine/runc and replacing them with podman-docker/crun. Move to cgroupsV2 and we have a Win/Win.

Conan-Kudo commented 3 years ago

Sticking with cgroups V2 is just a bad decision.

Freudian slip? I guess you mean cgroups V1.

Win/Win

It's only a win if you're not breaking people in the process. I don't have a problem with Fedora CoreOS making that change, but it needs to be communicated and compatibility stuff must be in place at both the CLI and API levels.

jtligon commented 3 years ago

One of the reasons quay.io chose fcos over rhcos was because they wanted moby for builds. They could package layer it but they’d rather not.

Their condition for switching to buildah was that every dockerfile built with buildah had the same output as docker build. I have no idea what our gap is there but I’d like to know more about it.

Since we’d be causing work for them, I’d like to make sure they are not hit by this out of the blue and would prefer them to agree to it.

rhatdan commented 3 years ago

Yes I meant the Cgroups V1.

I believe quay.io is no longer using Docker for builds, they are using Buildah. And even docker build and talk to podman listening on docker.sock to do the builds.

Fedora-CoreOS should be about the future, but now it's defaults are behind Fedora's defaults, because it is locked to a somewhat poorly maintained/support moby-engine.

If it is the upstream of RHEL-CoreOS, it would be nice if it was pioneering the next generation of OpenShift which is going to be on CgroupsV2.

Podman on a MAC is looking at using Fedora CoreOS as it's preferred platform, but it is too large and does not default to cgroups V2...

rhatdan commented 3 years ago

If we break workloads of users of Fedora CoreOS, the Podman team is very responsive and can quickly get out fixes. Podman is focused on Fedora Workloads and does almost all of it's development there, so it is much more likely to work well on that platform, and be a lot more responsive to customer issues then the moby-engine project.

Conan-Kudo commented 3 years ago

Podman on a MAC is looking at using Fedora CoreOS as it's preferred platform, but it is too large and

What's your size target?

does not default to cgroups V2...

This I want to see change, for sure!

rhatdan commented 3 years ago

I don't have a size target, but I want smaller.

Carrying around huge GOBinaries (docker, dockerd, containerd, runc) that are unlikely to be used, and have viable replacements, makes little sense.

I am not sure if Fedora CoreOS defaults to cri-o and kublet, but those I might have to live with.

Conan-Kudo commented 3 years ago

cri-o and kubernetes-node are not in FCOS, and instead have to be layered on using rpm-ostee.

rhatdan commented 3 years ago

Great, exactly what I would want. I would like to move moby-engine to the same.

If we wanted Cgroups V2 on by default with just Podman, is there a Rawhide for Fedora CoreOS, to prove this is viable. So people could test and we could fix any issues that users found.

Conan-Kudo commented 3 years ago

I believe @jlebon was working on making this a reality...

rhatdan commented 3 years ago

This issue is 1.5 years old, and we have seen little movement for the same reason it took years to move any distro to use CGroups V2. Docker/MobyEngine was blocking the way forward. Once Fedora Forces the issue, people started working on CgroupsV2 support and now Kubernetes is ready to start running on it, with a few more fixes.

Companies like Facebook are doing awesome things with CgroupsV2, Podman does some great things with rootless users with cgroups V2. But the Container Operating system looking to the future is stuck on V1?

Conan-Kudo commented 3 years ago

But the Container Operating system looking to the future is stuck on V1?

This is the downside of a container OS that's often considered deeply tied to Kubernetes use-cases. Also notably, OKD/OpenShift isn't even yet on a kubernetes version with cgroup v2 support. Has @LorbusChris had a chance to weigh in about the state of things on that front?

rhatdan commented 3 years ago

Yup, but I see these as a Chicken & Egg. Kubernetes support will lag because underlying platform does not default. Underlying OS Stays on old stuff because Kubernetes support is not there.

Like we did with Fedora, we need Carot and Stick, to force OKD/OpenShift forward. Otherwise we will be doing this downstream first.

RHEL9/Fedora 31 have switched their defaults. The rest of the distro world is stuck waiting for Kubernetes, which is 90% ready. and will only get the last 10 % done, when people actually start using CgroupsV2 by default.

Conan-Kudo commented 3 years ago

Just in case there's any misunderstanding: I completely agree with you and I think we should do this now. I'd even go so far to say that we should do this for the Fedora CoreOS 34 release.

rhatdan commented 3 years ago

Awesome, now I got to find right group to Whine at.

jlebon commented 3 years ago

To be clear, we've already decided that we will move to defaulting to cgroup v2 in the f34 timeline.

Whether to drop moby-engine is a separate (but related) question. But I don't think it's something we should do quickly (so definitely not for f34 GA).

LorbusChris commented 3 years ago

As long as we go through with the move to cgroupsv2 in F34, I'm fine with moby staying in the base for now. However, pushing out the transition to cgroupsv2 for another release cycle just because the moby RPM isn't ready yet is something I really want to avoid.

dghubble commented 3 years ago

To share a perspective from an outside use case for Kubernetes on FCOS, clarity on the container runtime(s) timeline and recommendations is the most looming need. We use podman for system daemons (etcd, etc.) fine. That isn't my main concern. The container rutime for use by Kubernetes is the concern and a critical one. Docker is still used only because of the gap in clarity, but we need to shift soon. We're shipping the latest Kubernetes (I don't agree we're waiting on old versions) and could quickly adopt the favored runtime if it works reliably and has a clear future on FCOS (Flatcar is going with containerd before k8s v1.22, FCOS is going with ??).

Conan-Kudo commented 3 years ago

Fedora's preferred Kubernetes CRI implementation is CRI-O, which is shipped as a module.

LorbusChris commented 3 years ago

It's not super ergonomic to install cri-o on FCOS today as the RPM is provided as a module so multiple version streams can be made available simultaneously. And unfortunately rpm-ostree doesn't support installing from modular repos. yet.

In OKD's machine-os-content OSTree, the RPM is downloaded, unpacked and layered onto FCOS in a somewhat awkward manner (https://github.com/openshift/okd-machine-os/blob/master/entrypoint.sh#L134-L135).

Downloading the cri-o RPM manually and then installing it with rpm-ostree install /path/to/cri-o.rpm should also work I believe.

dghubble commented 3 years ago

Yeah, those details have made cri-o seem like not a ready option. These are real production clusters used by a lot of people. I can set aside some time to try to make that palatable in an Ignition flow. Presuably it also needs a reboot which affects provision times and other flows folks have.

Conan-Kudo commented 3 years ago

It shouldn't require a reboot anymore, since you can install and have that apply live in the latest rpm-ostree versions.

dghubble commented 3 years ago

I think the container runtime topic is bigger, moving to #767 to avoid distracting from the specific issue of cgroups v1/v2

travier commented 3 years ago

https://bugzilla.redhat.com/show_bug.cgi?id=1903426 > Moby 20.15 has been pushed to Fedora so hopefully we can get that ready for F34 which will make the v2 by default case complete.

cgwalters commented 3 years ago

I'm confused about the history/plan here - I found https://github.com/coreos/fedora-coreos-config/pull/238 but it seems to not actually exist in testing-devel? It looks like the last word on this is 73cac9faba419afc0dbed1dde66a6b9987cf02ee

Was it discussed here for the plan to be basically we don't change the kargs across upgrades; it's just newly provisioned F34COS nodes that use cgroupsv2 (by default because we stop injecting the karg into new images)?

jlebon commented 3 years ago

I'm confused about the history/plan here - I found coreos/fedora-coreos-config#238 but it seems to not actually exist in testing-devel? It looks like the last word on this is 73cac9faba419afc0dbed1dde66a6b9987cf02ee

Yeah, that service just lived in the streams temporarily because it was a one-time migration thing (since new nodes already shipped with needed karg to stay on v1).

Was it discussed here for the plan to be basically we don't change the kargs across upgrades; it's just newly provisioned F34COS nodes that use cgroupsv2 (by default because we stop injecting the karg into new images)?

Yeah exactly, see https://github.com/coreos/fedora-coreos-tracker/issues/292#issuecomment-768430721 and https://github.com/coreos/fedora-coreos-tracker/issues/292#issuecomment-768478811.

Now, I'd like to go ahead with this on the next stream on top of the f34 rebase (https://github.com/coreos/fedora-coreos-config/pull/902). Though because this is defined in image.yaml, which is shared between next-devel and testing-devel, that makes it a little harder to do. Hmm, I think we'll want to do something similar to the manifests, where we have a shared image-base.yaml and each branch has an image.yaml which is not synced and which includes the base plus any stream-specific tweaks (edit: opened https://github.com/coreos/coreos-assembler/pull/2096 and https://github.com/coreos/fedora-coreos-config/pull/908).

jlebon commented 3 years ago

So timing-wise, is anyone opposed to scheduling changing next over to cgroupsv2 in the next next release (i.e. not the one on Monday which includes the f34 rebase)? We should send an email to the list with instructions on how to provision nodes into v1 mode (basically the inverse of https://docs.fedoraproject.org/en-US/fedora-coreos/kernel-args/).

jlebon commented 3 years ago

https://github.com/coreos/fedora-coreos-config/pull/910 moves next-devel to cgroupsv2.