bottlerocket-os / bottlerocket

An operating system designed for hosting containers
https://bottlerocket.dev
Other
8.77k stars 517 forks source link

Support for multi user containers with podman on kubernetes #2163

Open a-username-1 opened 2 years ago

a-username-1 commented 2 years ago

What I'd like: We are running bottlerocket in EKS, and we have a use case to support multi user containers with podman on unprivileged kubernetes containers. It's failing on setting the uid/gid maps. Single user containers run fine with podman on this setup.

Any alternatives you've considered: We have discussed using a different OS on the EKS nodes to get this working, but most of them would require disabling SELinux which we don't want to do.

jpculp commented 2 years ago

Hi @a-username-1, thanks for reaching out! We're taking a look at this and will get back to you soon.

bcressey commented 2 years ago

Some notes from my initial investigation.

uidMappings and gidMappings are part of the OCI spec, and can be set through POSIX platform mounts and user namespace mappings.

This spec gets passed to runc as "config.json". There's nothing in containerd-cri or kubelet that would populate those fields, which means by default runc won’t do anything.

For a quick proof-of-concept, we can take the output of ctr oci spec (which is the default base spec, more or less) and identify the edits we’d want to make so that runc does the right thing. The revised spec could be baked into the image so that all containers inherit that setting. That would probably fail because runc would try to invoke newuidmap to apply the change, which isn’t in Bottlerocket.

To get to the point where this actually worked in Bottlerocket, we’d need to:

Ideally the subuid and subgid files would only have a single user and a large range of UIDs and GIDs, but that might depend on what the right way to set up the OCI spec is, which isn’t clear to me at this stage.

jpmcb commented 2 years ago

Beginning the initial discovery work here - I'll first document what other operating systems might be doing to accomplish this with their uid/gid mappings. Then, like Ben laid out, we can start to look at implementing some first party code that mimics what newuidmap and newgidmap are doing (to avoid importing all of shadowuitils).

Pushing this out to v1.11.0 to give some breathing room for the investigative work, determining what needs to happen in bottlerocket, and implementing any new utilities.

jpmcb commented 2 years ago

Hi @a-username-1 - Would you be able to provide some yaml I can deploy to EKS to reproduce this issue? This will also be useful when deploying to non-bottlerocket nodes to see what the diff might be. Thanks much!

a-username-1 commented 2 years ago

Yeah, give me some time, should be able to come up with the config to reproduce this.

jpmcb commented 2 years ago

Hi @a-username-1 - touching base here: any luck getting a working test case?

jpmcb commented 2 years ago

Hi all - I recently discovered that UID and GID mappings in runc only occur when new user namespaces are created. This tracks with additional background information I found on the topic. See this chunk from the Linux man pages for user namespaces:

When a user namespace is created, it starts out without a mapping
of user IDs (group IDs) to the parent user namespace.  The
/proc/[pid]/uid_map and /proc/[pid]/gid_map files (available
since Linux 3.5) expose the mappings for user and group IDs
inside the user namespace for the process pid.  These files can
be read to view the mappings in a user namespace and written to
(once) to define the mappings.

...

The initial user namespace has no parent namespace, but, for
consistency, the kernel provides dummy user and group ID mapping
files for this namespace.

Ref: https://man7.org/linux/man-pages/man7/user_namespaces.7.html

Creation of user namespaces from within Kubernetes is partially supported as of the 1.25 release. However, it is behind an alpha feature gate and the graduation criteria are not yet fully defined, so it is hard to say when it would be available for managed clusters like EKS.

Can you clarify whether the phase 1 limitations would work for your use case? In particular, pods would be restricted to only a few volume types where files are not shared with other pods.