Open a-username-1 opened 2 years ago
Hi @a-username-1, thanks for reaching out! We're taking a look at this and will get back to you soon.
Some notes from my initial investigation.
uidMappings and gidMappings are part of the OCI spec, and can be set through POSIX platform mounts and user namespace mappings.
This spec gets passed to runc as "config.json". There's nothing in containerd-cri or kubelet that would populate those fields, which means by default runc won’t do anything.
For a quick proof-of-concept, we can take the output of ctr oci spec
(which is the default base spec, more or less) and identify the edits we’d want to make so that runc
does the right thing. The revised spec could be baked into the image so that all containers inherit that setting. That would probably fail because runc
would try to invoke newuidmap
to apply the change, which isn’t in Bottlerocket.
To get to the point where this actually worked in Bottlerocket, we’d need to:
newuidmap
is available for runc
settings.oci-hooks.delegate-uid-gid-range.enabled = true
to toggle that hook/etc/subuid
+ /etc/subgid
filesIdeally the subuid
and subgid
files would only have a single user and a large range of UIDs and GIDs, but that might depend on what the right way to set up the OCI spec is, which isn’t clear to me at this stage.
Beginning the initial discovery work here - I'll first document what other operating systems might be doing to accomplish this with their uid/gid mappings. Then, like Ben laid out, we can start to look at implementing some first party code that mimics what newuidmap
and newgidmap
are doing (to avoid importing all of shadowuitils
).
Pushing this out to v1.11.0 to give some breathing room for the investigative work, determining what needs to happen in bottlerocket, and implementing any new utilities.
Hi @a-username-1 - Would you be able to provide some yaml I can deploy to EKS to reproduce this issue? This will also be useful when deploying to non-bottlerocket nodes to see what the diff might be. Thanks much!
Yeah, give me some time, should be able to come up with the config to reproduce this.
Hi @a-username-1 - touching base here: any luck getting a working test case?
Hi all - I recently discovered that UID and GID mappings in runc
only occur when new user namespaces are created. This tracks with additional background information I found on the topic. See this chunk from the Linux man pages for user namespaces:
When a user namespace is created, it starts out without a mapping
of user IDs (group IDs) to the parent user namespace. The
/proc/[pid]/uid_map and /proc/[pid]/gid_map files (available
since Linux 3.5) expose the mappings for user and group IDs
inside the user namespace for the process pid. These files can
be read to view the mappings in a user namespace and written to
(once) to define the mappings.
...
The initial user namespace has no parent namespace, but, for
consistency, the kernel provides dummy user and group ID mapping
files for this namespace.
Ref: https://man7.org/linux/man-pages/man7/user_namespaces.7.html
Creation of user namespaces from within Kubernetes is partially supported as of the 1.25 release. However, it is behind an alpha feature gate and the graduation criteria are not yet fully defined, so it is hard to say when it would be available for managed clusters like EKS.
Can you clarify whether the phase 1 limitations would work for your use case? In particular, pods would be restricted to only a few volume types where files are not shared with other pods.
What I'd like: We are running bottlerocket in EKS, and we have a use case to support multi user containers with podman on unprivileged kubernetes containers. It's failing on setting the uid/gid maps. Single user containers run fine with podman on this setup.
Any alternatives you've considered: We have discussed using a different OS on the EKS nodes to get this working, but most of them would require disabling SELinux which we don't want to do.