containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
22.61k stars 2.31k forks source link

Group mapping in rootless #13090

Open quentin9696 opened 2 years ago

quentin9696 commented 2 years ago

/kind bug

Description

User group mapping are not keep when using --annotation run.oci.keep_original_groups=1

On the host:

$ id
uid=2001(test) gid=2001(test) groups=2001(test),1001(group1),2000(group2) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023

When I run the container:

$ podman run -it --rm --userns=keep-id --annotation run.oci.keep_original_groups=1 docker.io/library/bash
bash-5.1$ id
uid=2001(test) gid=2001(test) groups=65534(nobody),65534(nobody),2001(test)

I'm not sure to understand why my group1 and group2 are mapped with nobody.

Steps to reproduce the issue:

  1. Create a user

  2. Create 2 groups and add it to the user

  3. run a container with userns keep-id and with annotation run.oci.keep_original_groups=1 and check what are your groups. They should be mapped as your host

Describe the results you received:

$ id
uid=2001(test) gid=2001(test) groups=65534(nobody),65534(nobody),2001(test)

Describe the results you expected:

$ id
uid=2001(test) gid=2001(test) groups=1001(group1),2000(group2),2001(test)

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Version:      3.4.4
API Version:  3.4.4
Go Version:   go1.16.8
Built:        Wed Dec  8 21:45:07 2021
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers:
  - memory
  - hugetlb
  - pids
  - misc
  cgroupManager: cgroupfs
  cgroupVersion: v2
  conmon:
    package: conmon-2.0.30-2.fc35.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.30, commit: '
  cpus: 2
  distribution:
    distribution: fedora
    variant: coreos
    version: "35"
  eventLogger: file
  hostname: ip-10-124-2-41
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 2001
      size: 1
    - container_id: 1
      host_id: 493216
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 2001
      size: 1
    - container_id: 1
      host_id: 493216
      size: 65536
  kernel: 5.15.7-200.fc35.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 4070068224
  memTotal: 8241754112
  ociRuntime:
    name: crun
    package: crun-1.4-1.fc35.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.4
      commit: 3daded072ef008ef0840e8eccb0b52a7efbd165d
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/user/2001/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.1.12-2.fc35.x86_64
    version: |-
      slirp4netns version 1.1.12
      commit: 7a104a101aa3278a2152351a082a6df71f57c9a3
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 0
  swapTotal: 0
  uptime: 1h 10m 56.9s (Approximately 0.04 days)
plugins:
  log:
  - k8s-file
  - none
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /var/mnt/home/test/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /var/tmp/podman/user/2001/containers
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 3
  runRoot: /run/user/2001
  volumePath: /var/tmp/podman/user/2001/containers/volumes
version:
  APIVersion: 3.4.4
  Built: 1638999907
  BuiltTime: Wed Dec  8 21:45:07 2021
  GitCommit: ""
  GoVersion: go1.16.8
  OsArch: linux/amd64
  Version: 3.4.4

Package info (e.g. output of rpm -q podman or apt list podman):

I use fedora coreOS aws AMI

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

Run on AWS fedora coreos official image

rhatdan commented 2 years ago

Because those groupids are not mapped inside of the containers user namespace. If you run podman top CID hgroups you will see the leaked GIDs into the container.

User Namespace maps all UIDs not mapped into the User Namespace as 65534(nobody)

quentin9696 commented 2 years ago

Hi @rhatdan

I'm a bit confused. When I run podman top CID hgroups, I got

HGROUPS
558749,558749,2001

Do I need to create group1, group2 inside the container?

Thanks

rhatdan commented 2 years ago

No, did you run your container with --groups keep-groups

rhatdan commented 2 years ago
$ podman run -it --rm --userns=keep-id --annotation run.oci.keep_original_groups=1 docker.io/library/bash
bash-5.1$ id
uid=2001(test) gid=2001(test) groups=65534(nobody),65534(nobody),2001(test)

Now in a different terminal run podman top CID hgroups And it should show all 5 groups.

quentin9696 commented 2 years ago

Yes, that's what I did:

$ podman run -it --rm --userns=keep-id --annotation run.oci.keep_original_groups=1 docker.io/library/bash
bash-5.1$ id
uid=2001(test) gid=2001(xxxxxxx) groups=65534(nobody),65534(nobody),2001(test)
$ podman ps
CONTAINER ID  IMAGE                          COMMAND     CREATED        STATUS            PORTS       NAMES
070992ab2903  docker.io/library/bash:latest  bash        5 seconds ago  Up 5 seconds ago              quirky_nas

$ podman top 070992ab2903 hgroups
HGROUPS
427677,427677,2001
rhatdan commented 2 years ago

That looks correct, although podman top might have a bug here, since it printed out the first leaked group twice. @vrothberg PTAL, it looks like we might have a bug in podman top.

rhatdan commented 2 years ago

@giuseppe PTAL I am not sure we are leaking groups in podman 4.0

rhatdan commented 2 years ago
$ podman -v
podman version 4.0.0-dev
$ groups
dwalsh wheel users
$ podman run -d --group-add keep-groups  alpine top
16fe1fbbdd9ebc0c49760b54c62ef81e5ad480e694492d05223e6f43ccb84a34
$ podman top -l hgroups
HGROUPS
165533,165533,3267
$ podman top -l groups
GROUPS
nobody,nobody,root
giuseppe commented 2 years ago

it seems to work for me.

What groups do you have on the host?

Can you check grep ^Groups /proc/$CONTAINER_PID/status ?

quentin9696 commented 2 years ago

Just to make sure to understand well what's happen.

If I run podman in rootless, and add the --group-add keep-groups flag, I should have the same groups on the container and host. In my case, I should see my 2 other groups ids ?

giuseppe commented 2 years ago

The Linux kernel maps gids that are not part of the user namespace mapping to the overflow gid.

vrothberg commented 2 years ago

Yes, an example would be:

~ $ groups; podman unshare groups
vrothberg wheel
root nobody
quentin9696 commented 2 years ago

What can be solutions to be able to also map gid that are not part of the user namespace ?

quentin9696 commented 2 years ago

In my case:

grep ^Groups /proc/2410/status
Groups: 1001 2000 2001 
vrothberg commented 2 years ago

podman top requires to be inside podman's user NS in order to join the container's PID NS.

So I think we had to find a way to "leak" the host process' groups (e.g., export HOSTS_GROUPS=$(groups)) into podman's user namespace. @giuseppe WDYT?

rhatdan commented 2 years ago

Is the HOSTS_GROUPS available inside of the container, or just to podman top?

vrothberg commented 2 years ago

Is the HOSTS_GROUPS available inside of the container, or just to podman top?

It does not exist yet but I would leak it before re-execing into Podman's User NS. groups(1) would not be sufficient though since we'd need the ID and the name. I don't think we should leak it into the container for security reasons; any info about the host could theoretically be exploited.

rhatdan commented 2 years ago

Right, I thing you could set this in the user namespace by default then top could find it, I think the GIDs are all you need, since the user namespace still has access to the /etc/group on the host.

$ grep Group /proc/self/status
Groups: 10 100 3267 
$ podman unshare grep Group /proc/self/status
Groups: 65534 65534 0 

@giuseppe should we leak this always to the user namespace or only when running top, we could force this to happen in rootless.c?

giuseppe commented 2 years ago

would that work though?

We are injecting the groups of the current process, but we should read the /proc/$CONTAINER_PID/status file instead since in theory, they could be different (user added to a new group and runs newgrp).

rhatdan commented 2 years ago

Yes, this kind of sucks, Is there away to first look at the process out side of the user namespace and then enter the user namespace to continue into the pid namespace?

giuseppe commented 2 years ago

Yes, this kind of sucks, Is there away to first look at the process out side of the user namespace and then enter the user namespace to continue into the pid namespace?

I am still looking into it, if we can leak /proc somehow, but the IDs are always converted depending on the reader:

$ podman run --rm -v /proc:/proc-host --uidmap 0:1000:10000 alpine grep ^[UG]id /proc-host/1/status

The only way so far seems to do it in two steps, do not join directly the user namespace and read this information from the host, then re-exec a helper process to read everything else.

It looks like a corner case though, is it even worth to support in podman top? Could we just mark these IDs so that it is clear they are injected from the host?

vrothberg commented 2 years ago

It looks like a corner case though, is it even worth to support in podman top?

I agree. It looks like a substantial massaging of the code for a corner case.

Could we just mark these IDs so that it is clear they are injected from the host?

Can you elaborate on what you mean by "marking"?

giuseppe commented 2 years ago

Just convert the overflow id to something clearer like "Not Mapped" or something people can understand more easily

rhatdan commented 2 years ago

Well that is the issue, everyone who has hit this errors is already complaining about seeing the

$ podman run --group-add=keep-groups alpine groups root nobody nobody

Couldn't we just leak in a list of groups via environment variable on podman top, and then substiture the nobody for IDs on the list other then the primary group. If there are no matches for nobody group then we just drop thinking that there is no leak.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 2 years ago

@vrothberg @giuseppe Lets talk about this at Watercooler tomorrow.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

vrothberg commented 1 year ago

Couldn't we just leak in a list of groups via environment variable on podman top, and then substiture the nobody for IDs on the list other then the primary group. If there are no matches for nobody group then we just drop thinking that there is no leak.

@rhatdan how would that env variable look like? Wouldn't we need to inject the entire mapping? That would make me nervous for security reasons.

giuseppe commented 1 year ago

it could be the output of grep ^Groups /proc/self/status.

The problem I see is that this information may be different than what the container process is using. It is rarely changed, but if it happens then it is going to be difficult to find out what happened and why podman top returns the wrong information

rhatdan commented 1 year ago

Well Podman top returns the wrong information now.

The issue is we can not get the actual GIDs of the leaked FDs, If we just leaked the FDs in as the Current list and we found a matching list of NOBODYS we would be 99% sure that they are the leaked FDs.

rhatdan commented 1 year ago

Actually I think we would need to record the grep ^Groups /proc/self/status. into the container info, so we could record these were leaked. Then podman top could look this information up, when it sees multiple NOBODY groups in the /etc/group.

github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

zeehio commented 1 year ago

Update

Apparently what I want is called rootless id mounts and it is not supported yet in the kernel due to security concerns in the design.

My "solution" here is a proposal for (1) a permission system for rootless id mounts and (2) an idea of not only mapping "container uids to high uids at the host" (/etc/subgid) but also the opposite, mapping "low uids at the host to high uids inside the container". With both the permission system and the gid inversion (low->high & high-> low) rootless mapping of secondary groups should not be a problem.

However I guess the following applies:

If it was that easy it would have been done already.

Thanks anyway for reading. And apologies for probably wasting your time, I'm learning.

Context

When using rootless containers, for instance with podman, podman creates a user namespace following settings defined at /etc/subuid and /etc/subgid.

These settings allow to map users and groups in the user namespace (inside the container) to a reserved range (if done correctly the range is unique for each user) in the host/parent namespace.

This correspondence is used so we can create files with different user/group ownership inside the namespace that do not collide with any other user in the host namespace. Specifically the 0 UID and the 0 GID in the userspace are mapped to the default user id and the default group id, so it's easy for the user namespace processes to know how to make files owned by the parent user: just assign them to root inside the user namespace.

Problem

I do not know of an easy way to configure the opposite: I would like to map groups in the host to a reserved range inside the namespace (you have called this "group leaking"). For instance if I have an "engineering" group in my host system, e.g. with gid 1000, as system administrator, I would like for the default user namespaces in rootless podman to see mounted host files belonging to the "engineering" group (and ideally not other random files in the container) as belonging to the "engineering" group inside the user namespace as well.

Solution?

I believe it would make sense to have a /etc/revsubgid file specifying a list of groups that should leak into the user namespaces by default.

This list could be given in the following format:

<gid_host>:<uids_filter>:<gids_filter>

For instance:

engineering::engineering

Would automatically map, for all users in the engineering group (as given by the last field) the engineering group (first field).

This would be convenient for rootless containers that are expected to access directories mounted as volumes owned by secondary groups.

podman (via crun) can now use --groups-add keep-groups to preserve group access. However (correct me otherwise) I understand the kernel maps those groups to overflow IDs. Seeing all those nobody is unintuitive to me.

Besides leaking the groups in the namespace, podman could additionally append the leaked groups into the container /etc/groups file, and modify the /etc/passwd file in the container adding the root user to the leaked groups, so the root user in the container would have transparent access to the leaked groups and the group names would appear with the same name as the host.

Final words

If that's already doable with some setting and I have missed it, I apologize.

I would appreciate your feedback. I am not sure if I can contribute to this, since this is far from my field of knowledge, but for sure I'd love to use this feature.

Thank you for your time reading this and your work in podman.

codonell commented 1 year ago

Adding my +1 here as an upstream glibc developer.

Developers are using distrobox and toolbox to develop glibc and one of the limitations they run into is that the glibc testsuite users secondary groups for testing the POSIX identity management APIs. Often we require just one additional supplementary group, and we need to be able to validly find the group via getgrouplist and then use fchown.

Having a straight forward way to map at least some host groups into the container would be useful.

We've worked around this today and mark a subset of tests as unsupported in the container configurations that lack the requisite configurations. This isn't new, there are some tests we can't run in containers at all (like tests which use namespace isolation themselves to test things).

giuseppe commented 1 year ago

Developers are using distrobox and toolbox to develop glibc and one of the limitations they run into is that the glibc testsuite users secondary groups for testing the POSIX identity management APIs. Often we require just one additional supplementary group, and we need to be able to validly find the group via getgrouplist and then use fchown.

this won't work even if we solve the issue above. A group will show as overflow id inside the user namespace, the kernel controls that and we have no way to change this behavior. I think that for your use case, there is need to have a correct mapping for the groups, in a way that setgroups work fine inside the container without the keep-id workaround.

For a rootless user, you need to make sure these additional GIDs are added through /etc/subgid and then run podman system migrate to recreate the user namespace

rhatdan commented 1 year ago

It would be great if user groups could be added to the new user namespace via newgidmap, but I guess the risk DAC_OVERRIDE, might allow users to modify group files.