[rootless] question: plan for supporting cgroups?

AkihiroSuda commented 6 years ago

Rootless mode could support cgroups when pam_cgfs.so is available ( https://github.com/opencontainers/runc/issues/1839 cc @cyphar), but it is not available on Fedora (AFAIK)

Is there plan for supporting pam_cgfs.so or any equivalent of that?

(This question is not specific to podman, and I'm not sure this repo is the right place to ask this question :p)

rhatdan commented 6 years ago

Is there a way to handle this via communications with systemd? It would seem that systemd probably provides a mechanism for user apps to manipulate the cgroups available to the user?

brauner commented 6 years ago

systemd won't do unprivileged cgroup delegation on v1 hierarchies since there's no way to do it safely so it needs to be up to the administrator to switch this on. So on v1 hierarchies now way to get this going without pam_cgfs.so that's why we @lxc wrote it in the first place. The trick is to limit this to cgroups you really really care about for your runtime, that's why the pam module takes arguments which the administrators needs to explicitly set.\ Now, the story is different for cgroup v2. You can talk to systemd via dbus if you feel like linking against a bunch of xml. Or you request the delegated property in a service file or - if you have a daemon - the daemon requests the delegation and creates two parallel cgroups on the same level of the hierarchy and moves itself into one and the container into the other. In any case this requires that the runtime never escapes to the root cgroup.

cyphar commented 6 years ago

@brauner Does systemd remount /sys/fs/cgroup with nsdelegate now? Or is this something that systemd has yet to come up with a proper setup for -- since it would technically allow cgroupv2 delegation without systemd integration by just using a cgroup namespace.

rhatdan commented 6 years ago

From devel list on fedora, this is @poettering response.

I am not sure what pam_cgfs.so precisely does, but do note that on
systemd systems (which includes Fedora) systemd is the owner of the
cgroup tree, and the only means by which other components may manage
their own subtrees is through cgroup delegation (which includes
delegation to less privileged users), which you can request from
systemd.

You can request cgroup delegation from the system service manager, for which
you need to be privileged (but you can request it on behalf of an
unprivileged user).

You can also request cgroup delegation from your private user service
manager instance, for which you do not need to be privileged.

The APIs for requesting cgroup delegation from the system service
manager or your user service manager is the same, the only difference
is whether you do so through the system or user dbus bus.

Note that on cgroupsv1 delegation of *controllers* (i.e. "cpu",
"cpuset", "memory", "blkio", …) to unprivileged processes is not safe
(this is a kernel limitation) and systemd won't do it hence. On
cgroupsv2 it is safe however, and hence you will get "memory" and
"tasks" delegated by default (though not "cpu" by default as the
runtime impact of that is still too high).

Do note however, that Docker is blocking us from switching Fedora over
to cgroupsv2 though, as there is still no working support for
cgroupsv2 in Docker, nor support for requested cgroup tree delegation
from systemd. It's a shame that Docker is hindering us from making the
switch, but it is how it is. Docker currently bypasses systemd
entirely when it comes to cgroupsv2 and considers itself to be owner
of the cgroup tree which is a mess on cgroupsv1 (though you have a
chance of getting away with it) and doesn't work at all on cgroupsv2.

Or in other words: if you are looking for a way to get your own
per-user delegated cgroup subtree, simply ask systemd for it by
setting Delegate=yes in your service unit file, or by asking for
a scope unit to be registered, also with Delegate=yes set. Nothing
else is supported.

Lennart

rhatdan commented 6 years ago

Of course if Podman is successful we could add better support for V2 CGroups to allow distributions to have the option.

cyphar commented 6 years ago

I thought you guys used the systemd cgroup driver on Fedora/RHEL -- that does exactly what Lennart is referring to as "hindering [him]". In addition, it's untrue that Docker (or runc) are entirely responsible for blocking cgroupv2 adoption -- a lack of the freezer cgroup (and a usable devices cgroup that doesn't depend entirely on eBPF) is blocking a wholesale switch to cgroupv2.

Most of the points Lennart made are just longer versions of what @brauner said earlier as well (though his disdain of pam_cgfs.so even though he doesn't know what it does ignores that @brauner explicitly said that you need to be careful when using it precisely for that reason -- many cgroupv1 controllers are safe for unprivileged use).

But I guess that answers my question on whether systemd intends to support nsdelegate -- it appears the answer is "no" given that he is discussing asking systemd for permission to delegate cgroupv2 cgroups even though the kernel has an explicit feature to allow this (nsdelegate and CLONE_NEWCGROUP).

jerboaa commented 6 years ago

How do people feel about mentioning this root-less issue in the man page or provide some warning when --memory is known to NOT work? Currently I see this in podman-run's man page:

   -m, --memory=""

   Memory limit (format: <number>[<unit>], where unit = b, k, m or g)

   Allows you to constrain the memory available to a container. If the host supports swap memory, then the -m memory setting can be larger than physical RAM. If a limit of 0 is specified (not using
   -m), the container's memory is not limited. The actual limit may be rounded up to a multiple of the operating system's page size (the value would be very large, that's millions of trillions).

It doesn't say that one needs to run podman as root in order for it to be effective. Neither mentions the command itself any limitation :-( Something like this would have been helpful:

 $ whoami
 someuser
 $ podman run [...] --memory=10m fedora:28 /bin/bash
 Warning: --memory as unprivileged user has no effect.
 bash-4.4#

brauner commented 6 years ago

Why not simply fail if a user requests memory constraints or any cgroup constraints that the user requested but that runC can't fulfill. Seems printing a simple warning can be easily overlooked.

mheon commented 6 years ago

I agree that failing it probably the appropriate course of action.

On the manpages - they really need a thorough overhaul to show what can and cannot be done with rootless.

cyphar commented 6 years ago

@brauner runc should already do that (we have tests to make sure and everything :wink:), I have the feeling that podman is removing cgroup entries or something fishy?

rhatdan commented 6 years ago

@giuseppe PTAL

rhatdan commented 6 years ago

@brauner Please open a PR for the updated man page. --memory should definitely error out.

giuseppe commented 6 years ago

@rhatdan I've opened a PR here: https://github.com/containers/libpod/pull/1547

rhatdan commented 5 years ago

@AkihiroSuda @giuseppe Where are we on this issue?

AkihiroSuda commented 5 years ago

@rhatdan It looks like Fedora is not going to adopt pam_cgfs.so (or any its equivalent for cgroups v1), so probably we need to implement cgroups v2 for runc instead: https://github.com/opencontainers/runc/issues/654 (needs consideration for device & freezer stuff)

rhatdan commented 5 years ago

Well I would love to go to CGroups v2, but we always seem to be stuck in V1 world.

cyphar commented 5 years ago

Not to mention that switching to cgroupv2 is effectively a userspace regression for containers, because now programs that understand cgroupv1 cannot work with cgroupv2 (and if you switch partially to cgroupv2 then programs in containers cannot use cgroupv1 and thus would be be broken by the switch).

File capabilities v3 was handled by silently doing conversions (in quite a few ways) specifically to avoid this problem. But cgroupv2 has no such work, and as a result there will probably be a split like this for a very long time...

poettering commented 5 years ago

Not to mention that switching to cgroupv2 is effectively a userspace regression for containers, because now programs that understand cgroupv1 cannot work with cgroupv2 (and if you switch partially to cgroupv2 then programs in containers cannot use cgroupv1 and thus would be be broken by the switch).

Note that it's possible to set up a cgroupsv1 compatible environment for container payloads on a cgroupsv2 host and vice versa (if the kernel supports both APIs). systemd-nspawn supports that for example. It's a bit restricted though, since it means you can't reasonably delegate any controllers to the container, but quite frankly controller delegation on cgroupsv1 is unsafe anyway, and hence not desirable, regardless if the host runs cgroupsv2 or cgroupsv1.

Or in other words: whether the host runs a cgroupsv2 or cgroups1 setup does not necessarily have to have effect on what the container payloads see.

cyphar commented 5 years ago

Note that it's possible to set up a cgroupsv1 compatible environment for container payloads on a cgroupsv2 host and vice versa (if the kernel supports both APIs).

How does this work? Last I checked, you have to enable an entire controller on either cgroupv1 or cgroupv2 and you can't use them in parallel. So if the host is using cgroupv2 controllers, then the container cannot use the cgroupv1 equivalent of the same controller simultaneously. This is what I was referring to.

poettering commented 5 years ago

How does this work? Last I checked, you have to enable an entire controller on either cgroupv1 or cgroupv2 and you can't use them in parallel. So if the host is using cgroupv2 controllers, then the container cannot use the cgroupv1 equivalent of the same controller simultaneously. This is what I was referring to.

As i wrote above, it's a bit limited, as delegating controllers is not really doable, since the host generally owns all. But that's not really a limitation, as I wrote above, since cgroupsv1 controller delegation is not safe anyway and hence shouldn't be done anyway, regardless if your host runs cgroupsv1 or cgroupsv2.

Controller delegation is only safe and secure in a full cgroupsv2 environment, i.e. where host and container run cgroupsv2. On a cgroupsv2 host, it's still OK and safe to delegate a named hierarchy via cgroupsv1 to a container though (e.g. the name=systemd hierarchy), which is sufficient to run systemd inside the container.

rhatdan commented 5 years ago

I don't think this is a question of people wanting to run systemd inside of a container, this is more about missing cgroups controllers in V2 that the container engines and runtimes want to use like the Freezer cgroup.

cyphar commented 5 years ago

@poettering

On a cgroupsv2 host, it's still OK and safe to delegate a named hierarchy via cgroupsv1 to a container though (e.g. the name=systemd hierarchy), which is sufficient to run systemd inside the container.

Sure, that's fine -- you don't have named hierarchies in cgroupv2 anyway :wink:.

But the point is about all of the other controllers that can only be enabled on cgroupv1 or cgroupv2 (pids, cpu*, memory, freezer soon, etc). For instance, newer versions of Java currently support cgroupv1 memory and cpu* -- but if we switch to cgroupv2 memory and cpu* on the host then the same container will likely not understand the hierarchy anymore and will regress back to the old behaviour (OOMing and thrashing regularly).

I get that systemd understands both and can handle both, but not everyone uses the systemd APIs to get cgroup information (nor do they need to -- cgroups are a kernel interface after all). And not everyone runs systemd inside their containers.

AkihiroSuda commented 5 years ago

The motivation of this issue is to control resource limits of rootless runc containers (that are launched by rootless podman and other container engines), not to run systemd in a container.

For discussion on cgroups v2 integration into runc, https://github.com/opencontainers/runc/issues/654 seems better place.

poettering commented 5 years ago

But the point is about all of the other controllers that can only be enabled on cgroupv1 or cgroupv2 (pids, cpu*, memory, freezer soon, etc). For instance, newer versions of Java currently support cgroupv1 memory and cpu* -- but if we switch to cgroupv2 memory and cpu* on the host then the same container will likely not understand the hierarchy anymore and will regress back to the old behaviour (OOMing and thrashing regularly).

Yeah, on the host it's all cgroupv1 XOR all cgroupv2.

Java does cgroups manipulation? why would it do that? that smells seriously fishy and broken. cgroupsv1 isn't really an API for unprivileged user processes, and I wasn't aware that Java was now used for privileged stuff. But yeah, if it does use cgroupv1 then this would stop working on cgroupv2 systems. But I am not sure why runc should care really... I mean, right now we are at the point that kernel and systemd work fine with cgroupv2, just the docker/runc/kubernetes kerfuffle is blocking switching over distros. I really wished that this was resolved soon. And I don't see why runc/… would have to care for Java in this regard, and as an excuse not to do the cgroupsv2 work.

I get that systemd understands both and can handle both, but not everyone uses the systemd APIs to get cgroup information (nor do they need to -- cgroups are a kernel interface after all). And not everyone runs systemd inside their containers.

Nobody is asking anyone to adopt systemd APIs for anything. Please have a look at how delegation works in systemd (both for cgroupsv1 and cgroupsv2), and you'll see you never actually have to call into a single systemd API function. You just need to be able to deal properly with the subtree delegated to you. This is documented here:

https://systemd.io/CGROUP_DELEGATION#delegation

poettering commented 5 years ago

I don't think this is a question of people wanting to run systemd inside of a container, this is more about missing cgroups controllers in V2 that the container engines and runtimes want to use like the Freezer cgroup.

the freezer controller might be useful for some cases (though awfully broken API-wise), and really shouldn't hold you back. I understand the container managers use it for fork bomb protection during container shutdown. Given how many holes the swiss cheese that is containers has I doubt it's really that big of a loss if the fork bomb protection is not available initially on cgroupsv2. Moreover you can easily implement a more efficient one with the "pids" controller (just lower the limits so that nothing can fork anymore), and maybe SIGSTOP thrown in. Finally, the cgroupsv2 folks upstream at facebook are mostly done with making freezer-like functionality available for cgroupsv2 too.

cyphar commented 5 years ago

@poettering

Java does cgroups manipulation? why would it do that? that smells seriously fishy and broken.

It doesn't manipulate cgroups, but it does read memory.limit_in_bytes and cpuset.cpus (et al) so that it can figure out what limits are actually in place (since /proc/meminfo and /proc/cpuinfo aren't cgroup-aware).

But I am not sure why runc should care really... I mean, right now we are at the point that kernel and systemd work fine with cgroupv2, just the docker/runc/kubernetes kerfuffle is blocking switching over distros.

Well, because switching would very likely break existing containers that have Java in them and work today. I think that's a reasonable thing to worry about.

Please have a look at how delegation works in systemd (both for cgroupsv1 and cgroupsv2), and you'll see you never actually have to call into a single systemd API function.

I'm aware of Delegate=, in fact runc and Docker depend on it quite heavily when users use the "systemd cgroup driver" rather than the default/native one. I do really wish systemd supported nsdelegate (which would allow for cgroup namespaces to actually be used as delegation boundaries under systemd without needing to modify systemd-specific files or have systemd-specific code).

I understand the container managers use it for fork bomb protection during container shutdown. Given how many holes the swiss cheese that is containers has I doubt it's really that big of a loss if the fork bomb protection is not available initially on cgroupsv2.

Virtually all container runtimes have a "pause" operation, which uses freezer.

Moreover you can easily implement a more efficient one with the "pids" controller (just lower the limits so that nothing can fork anymore), and maybe SIGSTOP thrown in.

I wrote the pids controller. :wink:

Finally, the cgroupsv2 folks upstream at facebook are mostly done with making freezer-like functionality available for cgroupsv2 too.

I saw this, and am super excited to see it get merged.

poettering commented 5 years ago

It doesn't manipulate cgroups, but it does read memory.limit_in_bytes and cpuset.cpus (et al) so that it can figure out what limits are actually in place (since /proc/meminfo and /proc/cpuinfo aren't cgroup-aware).

Given that cpuset.cpus on cgroupsv1 is actually the same thing as the normal process affinity mask (yes, they propagate towards each other, it's fucked), there's really no benefit in using cpuset.cpus. If they have a fallback to the affinity mask, that's totally sufficient...

I am not sure I grok why java wants to read that and what for. I mean, does it assume it's the only thing running inside a cgroup? What good is a memory measurement for for yourself when you don't know if it's actually you or something else too that is accounted into it? Sounds all very fishy to me...

Either way, this sounds like no big issue to me. A patch fixing this should be pretty straight-forwards, and it doesn't actually "break" stuff I guess anyway, except some stats...

I'm aware of Delegate=, in fact runc and Docker depend on it quite heavily when users use the "systemd cgroup driver" rather than the default/native one. I do really wish systemd supported nsdelegate (which would allow for cgroup namespaces to actually be used as delegation boundaries under systemd without needing to modify systemd-specific files or have systemd-specific code).

As I wrote countless times elsewhere and here: if you follow those guidelines then your program doesn't need anything systemd specific really: the whole delegation docs just say: you asked for delegation, you got it, now stay within the subtree you got, and you are fine.

Also systemd insists on nsdelegate when it's available. It's not even an option to opt-out of nsdelegate. Since a longer time actually.

jerboaa commented 5 years ago

It doesn't manipulate cgroups, but it does read memory.limit_in_bytes and cpuset.cpus (et al) so that it can figure out what limits are actually in place (since /proc/meminfo and /proc/cpuinfo aren't cgroup-aware).

Given that cpuset.cpus on cgroupsv1 is actually the same thing as the normal process affinity mask (yes, they propagate towards each other, it's fucked), there's really no benefit in using cpuset.cpus. If they have a fallback to the affinity mask, that's totally sufficient...

Speaking for OpenJDK, yes, it uses sched_getaffinity: http://hg.openjdk.java.net/jdk/jdk/file/571f12d51db5/src/hotspot/os/linux/osContainer_linux.cpp#l526

Having said that, cpuset.cpus isn't very common to be used in cloud frameworks. E.g. kubernetes uses cpu shares and cpu quotas.

OpenJDK takes cpu shares and cpu quotas into account. In doing so it makes some assumptions about the higher level cloud frameworks, like kubernetes, and how they set up and run containers. Example: http://hg.openjdk.java.net/jdk/jdk/file/571f12d51db5/src/hotspot/os/linux/osContainer_linux.cpp#l35

I am not sure I grok why java wants to read that and what for.

OpenJDK hotspot has its own memory management. If run in a container with memory limits, it needs to know so as to not run afoul of the OOM-killer. It would otherwise size its heap too big and eventually an OOM-kill would happen.

As for the CPU limits, it does that so that it can do some guestimate on the available CPUs. It's never going to be accurate, but as the JVM does some sizing of its threads (JIT threads, GC threads, etc.) based on CPUs it thinks it has available it works better if it takes cgroup limits into account.

I mean, does it assume it's the only thing running inside a cgroup?

It doesn't. However, that's actually a fairly common thing in cloud containers. Anyhow, it's better off considering the container limits than the actual host values.

What good is a memory measurement for for yourself when you don't know if it's actually you or something else too that is accounted into it? Sounds all very fishy to me...

Agreed. There is no perfect answer for this. But considering there is a container limit it can be assumed that the user wanted the entire container (cgroup) to not go beyond that limit. Be it one process or more.

Either way, this sounds like no big issue to me. A patch fixing this should be pretty straight-forwards, and it doesn't actually "break" stuff I guess anyway, except some stats...

Not sure if it's related, but we've discovered that with Kernel 4.18 and above the container detection breaks with systemd slices. Last working Kernel was 4.15. See: https://bugs.openjdk.java.net/browse/JDK-8217338

AkihiroSuda commented 5 years ago

https://fedoraproject.org/wiki/Changes/CGroupsV2

Is Red Hat working on cgroup2 support for runc?

@rhatdan @giuseppe @vbatts

giuseppe commented 5 years ago

@AkihiroSuda:

Filipe (@filbranden) is working on it: https://github.com/containers/conmon/issues/8

rhatdan commented 5 years ago

We are trying to support his efforts and make changes in Podman and Conmon to further his testing along. We are also working with the systemd team to make sure that they work with @filbranden.

Bottom line this is a high priority for us, and anything we can do to help this along, we shall help. @AkihiroSuda If you can also help that would be great.

filbranden commented 5 years ago

@AkihiroSuda Just cc'd you on https://github.com/opencontainers/runc/pull/1991 where I'm starting to fix libcontainer's systemd cgroup driver to actually always go through systemd (using the D-Bus interface) for all the writes.

That first PR is trying to establish an interface for the subsystems to translate their OCI configuration into systemd D-Bus properties, and it implements it for the "devices" controller (as a proof of concept.) Once the interface is approved/merged, we can convert the other cgroups (memory, cpu, etc.) and get all going through systemd.

Once that's in, I already have some code to gather the stats from the cgroupv2 tree (it's a fairly simple patch.)

So... progress! Watch that PR and pitch in if you like!

Cheers, Filipe

AkihiroSuda commented 5 years ago

Thanks, just to confirm, non-systemd cgroup2 is also going to be supported?

filbranden commented 5 years ago

No, only through the systemd cgroupdriver.

That's the thing, doing it through systemd gets it for free, we only go through D-Bus and systemd abstracts all that from us. The only remaining implementation is when getting statistics directly from the tree (memory.stat, cpu.stat, etc.) we need to find them at the proper place, but that's a small detail, a tiny commit, I already have a draft for it.

Frankly, I don't see cgroupv2 on cgroupfs cgroupdriver ever happening, since some controllers (such as "devices") were discontinued on cgroupv2, so systemd is actually installing an eBPF rule to implement device restrictions there. I really don't see libcontainer duplicating that effort... (But I might be wrong about it.)

In any case, I'd say 99% of systems I care about are running on systemd anyways, so going through it makes sense to me.

AkihiroSuda commented 5 years ago

So it doesn't work with nested containers and Alpine hosts?

filbranden commented 5 years ago

I, at least, am only looking to fixing the systemd path to support cgroupv2. So I think it won't work nested, unless you're running systemd inside your container (e.g. like KIND does.) I believe systemd can run in a fairly unprivileged container (at least in nspawn world...) but I haven't looked a lot into this, so you'd need to double check that...

vbatts commented 5 years ago

On 19/02/19 20:54 -0800, Filipe Brandenburger wrote:

I, at least, am only looking to fixing the systemd path to support cgroupv2. So I think it won't work nested, unless you're running systemd inside your container (e.g. like KIND does.) I believe systemd can run in a fairly unprivileged container (at least in nspawn world...) but I haven't looked a lot into this, so you'd need to double check that...

i suppose nothing is stopping a hook from mounting cgroup v1 for the container. It sounds gross and i'm not sure how manageable it would be.

poettering commented 5 years ago

On Mi, 20.02.19 14:05, Vincent Batts (notifications@github.com) wrote:

On 19/02/19 20:54 -0800, Filipe Brandenburger wrote: >I, at least, am only looking to fixing the systemd path to support cgroupv2. So I think it won't work nested, unless you're running systemd inside your container (e.g. like KIND does.) I believe systemd can run in a fairly unprivileged container (at least in nspawn world...) but I haven't looked a lot into this, so you'd need to double check that...

i suppose nothing is stopping a hook from mounting cgroup v1 for the container. It sounds gross and i'm not sure how manageable it would be.

note that nspawn actually supports running cgroupsv1 container payloads on a cgroupsv2 host. It does so by mounting the old hierarchies internally, and using that, replicating the minimal hierarchy from the cgroupsv2 tree as necessary. But this is pretty messy, since nobody maintains that tree and cleans it up afterwards.

rhatdan commented 5 years ago

I would like to see this get in, but in the rootless case, where we want to modify the cgroups of a container will this work? Will runc be able to talk to Systemd to setup a cgroup for the container?

poettering commented 5 years ago

I would like to see this get in, but in the rootless case, where we want to modify the cgroups of a container will this work? Will runc be able to talk to Systemd to setup a cgroup for the container?

i am not sure how unpriv runc precisely works. But note that PID1 (i.e. the system instance of systemd) will deny delegation of cgroups subtrees to unprivileged clients if they already dropped privs. However, it's fine to delegate cgroup subtrees to programs that start unpriv and drop privs later, as well as to service payloads that use systemd's User= and thus let systemd drop privs for you.

Also note that each regular user generally has their own systemd --user instance. Unpriv users can request their instance for a delegated subtree too, and this is then permitted. The APIs are exactly the same as they are for the system instance, except that you ask on the user rather than the system bus for delegation.

rhatdan commented 5 years ago

This sounds like exactly what we need. If a user is alloced X% of a resource, then we want them to further subdevice the X% to their containers.

giuseppe commented 5 years ago

I've written this message privately to some of you, but I'll report it here as well:

something I've noticed and that will block its adoption for rootless containers is that D-Bus doesn't work from a user namespace if euid_in_the_namespace != euid_on_the_host.

We create the user namespace to manage the storage and the networking before we call the OCI runtime. The OCI runtime for rootless containers can create a nested userns if there are different mappings used but it already runs within a userns with euid=0.

A simple test:

we create a user namespace but we keep the same uid we had on the host:

$ bwrap --unshare-user --uid $(id -u) --bind / / dbus-send  --session  --dest=org.freedesktop.DBus --type=method_call --print-reply /org/freedesktop/DBus org.freedesktop.DBus.ListNames

we use uid=0 in the user namespace, it doesn't work:

$ bwrap --unshare-user --uid 0 --bind / / dbus-send  --session --dest=org.freedesktop.DBus     --type=method_call  --print-reply /org/freedesktop/DBus org.freedesktop.DBus.ListNames
Failed to open connection to "session" message bus: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

I think it depends on D-Bus including the euid in the AUTH EXTERNAL request:

https://github.com/systemd/systemd/blob/master/src/libsystemd/sd-bus/bus-socket.c#L620

giuseppe commented 5 years ago

I think it depends on D-Bus including the euid in the AUTH EXTERNAL request:

https://github.com/systemd/systemd/blob/master/src/libsystemd/sd-bus/bus-socket.c#L620

being addressed by: https://github.com/systemd/systemd/pull/11785

rhatdan commented 5 years ago

There continues to be progress being made in cgroupsv2.

vbatts commented 5 years ago

as you have gaps identified, please report them to upstream tracker https://github.com/opencontainers/runtime-spec/issues/1002

On Fri, Mar 8, 2019 at 11:36 AM Daniel J Walsh notifications@github.com wrote:

There continues to be progress being made in cgroupsv2.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/containers/libpod/issues/1429#issuecomment-470992048, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEF6SO7CkEGBpX-VGn01BhikB4lD-uiks5vUpGNgaJpZM4Wf2vk .

rhatdan commented 5 years ago

@filbranden Any update on the cgroupsv2 work?

filbranden commented 5 years ago

Hi @rhatdan

I just added an update to opencontainers/runc#2007 with a proposed approach.

I think we still need more work on the underlying components, to ensure everything is in place. In particular, we'll need freezer support in cgroup2 in the kernel (last I looked, it was planned for 5.2, but not sure if it's still in schedule) and systemd needs to export more cgroup2 interfaces to userspace, via D-Bus (such as freezer, as mentioned, and also cpuset, which I believe made it into kernel 5.0)

Cheers! Filipe

rhatdan commented 5 years ago

Thanks for keeping us up2date. I am watching the runc PRs and keeping up with it as best I can. @filbranden Keep up the good work. Eventually we will get there.

rhatdan commented 5 years ago

@giuseppe Since we now have cgroupsv2 support, can we close this PR?

giuseppe commented 5 years ago

@giuseppe Since we now have cgroupsv2 support, can we close this PR?

yes I think we can close the issue here and address any future issue separately

containers / podman

[rootless] question: plan for supporting cgroups? #1429