containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.13k stars 2.36k forks source link

Add ability to mask and unmask #7801

Closed TristanCacqueray closed 3 years ago

TristanCacqueray commented 3 years ago

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

When running a GPU application inside a rootless podman container started with --device /dev/dri libGL fails to initialize.

Steps to reproduce the issue:

$ podman run --security-opt label=disable -it --rm --device /dev/dri -v /tmp/.X11-unix:/tmp/.X11-unix registry.fedoraproject.org/fedora:32
[root@acf745c2e272 /]# dnf install -y mesa-dri-drivers glx-utils
[...]
[root@acf745c2e272 /]# DISPLAY=:0 glxinfo > /dev/null
libGL error: MESA-LOADER: failed to retrieve device information
libGL error: Version 4 or later of flush extension not found
libGL error: failed to load driver: i915
libGL error: MESA-LOADER: failed to retrieve device information

Describe the results you received:

Graphical applications like glxgears fail to start.

Describe the results you expected:

LibGL works and application starts.

Additional information you deem important (e.g. issue happens only occasionally):

It seems like a regression since https://github.com/containers/podman/pull/6957 Starting the container with --privileged makes /sys/dev available, but then for some reason the device file /dev/dri/card0 is not available.

rhatdan commented 3 years ago

If you run in privileged mode does this work? @giuseppe WDYT?

TristanCacqueray commented 3 years ago

Running in privileged mode does not seem to be enough as the device is not available.

giuseppe commented 3 years ago

I think we could add a --security-opt option to specify the list of paths that must be masked and override the default list.

Something like:

--security-opt masked-paths=/foo/bar:/baz

rhatdan commented 3 years ago

How about --security-opt unmask-path=/sys/dev

mheon commented 3 years ago

We could also add a --security-opt mask-path=$PATH to add masked paths - seems useful.

I would like the ability to --security-opt unmask-path=ALL as well.

rhatdan commented 3 years ago

I think having unmask and mask would be sufficient.

rhatdan commented 3 years ago

Would love to get some from the community to grab this.

paravz commented 3 years ago

--device functionality might need special handling to unmask device entries in "/sys/dev", ie if i start container with "--device /dev/dri/renderD128", device's entries in /sys/dev/char and /sys/devices should be unmasked. LibGL looks up device in /sys/dev/char according to my straces, see 226:128 example below:

# ls -la /dev/dri/renderD128
crw-rw-rw-. 1 root render 226, 128 Nov  2 02:32 /dev/dri/renderD128
# ls -lah  /sys/dev/char/226:128
lrwxrwxrwx. 1 root root 0 Nov  2 02:32 /sys/dev/char/226:128 -> ../../devices/pci0000:00/0000:00:02.0/drm/renderD128
awerlang commented 3 years ago

Running in privileged mode does not seem to be enough as the device is not available.

Intel? It works with --privileged. See https://github.com/mviereck/x11docker/issues/293

With AMD you might want to use --volume instead of --device. Not sure why though.

--device functionality might need special handling to unmask device entries in "/sys/dev", ie if i start container with "--device /dev/dri/renderD128", device's entries in /sys/dev/char and /sys/devices should be unmasked. LibGL looks up device in /sys/dev/char according to my straces, see 226:128 example below:

# ls -la /dev/dri/renderD128
crw-rw-rw-. 1 root render 226, 128 Nov  2 02:32 /dev/dri/renderD128
# ls -lah  /sys/dev/char/226:128
lrwxrwxrwx. 1 root root 0 Nov  2 02:32 /sys/dev/char/226:128 -> ../../devices/pci0000:00/0000:00:02.0/drm/renderD128

More simply, if --device is used, podman should know to not mask /sys/dev. Knowing what is needed under /sys/dev might prove problematic, and the end user would end up with just --privileged instead, which wouldn't otherwise be necessary.

I think that mask/unmask paths could be generally available for finer-grained priviledges, but I don't see the default masked paths documented. It should be a breaking change to update the default masked paths.

awerlang commented 3 years ago

I also noticed than masking out /sys/dev is not enough to prevent tools like lshw, lspci, lsusb to extract information from the host system. Not sure if this is the reason it was masked in the first place.

rhatdan commented 3 years ago

These are not listed anywhere but in the code, but we will document them when we have the ability to manipulate them

func BlockAccessToKernelFilesystems(privileged, pidModeIsHost bool, g *generate.Generator) {
    if !privileged {
        for _, mp := range []string{
            "/proc/acpi",
            "/proc/kcore",
            "/proc/keys",
            "/proc/latency_stats",
            "/proc/timer_list",
            "/proc/timer_stats",
            "/proc/sched_debug",
            "/proc/scsi",
            "/sys/firmware",
            "/sys/fs/selinux",
            "/sys/dev",
        } {
            g.AddLinuxMaskedPaths(mp)
        }

        if pidModeIsHost && rootless.IsRootless() {
            return
        }

        for _, rp := range []string{
            "/proc/asound",
            "/proc/bus",
            "/proc/fs",
            "/proc/irq",
            "/proc/sys",
            "/proc/sysrq-trigger",
        } {
            g.AddLinuxReadonlyPaths(rp)
        }
    }
}
paravz commented 3 years ago

Would this be reasonable to implement (i can make a pull request here or in a separate issue): if --device is used, treat it similarly to --privileged and exclude '/sys/dev' from masking and mount is read only ?

TristanCacqueray commented 3 years ago

@awerlang yes the issue is happening with Intel GPU and unprivileged rootless podman. This combo used to work before #6957.

awerlang commented 3 years ago

@awerlang yes the issue is happening with Intel GPU and unprivileged rootless podman. This combo used to work before #6957.

Unprivileged mode is under discussion. I quoted you:

Running in privileged mode does not seem to be enough as the device is not available.

If it doesn't work with --privileged, then it's a different issue, not effected by #6957. Refer to the discussion I linked above.

TristanCacqueray commented 3 years ago

That it is a different issue, but an additional important information as privileged mode is not even an option to workaround the absence of /sys/dev. Thus it seems like podman 2.1.1 can no longer run GPU workload.

umohnani8 commented 3 years ago

I am working on adding a mask and unmask option to --security-opt which you can use to specify additional paths you want to mask or any paths that you want to unmask. That should work with --device when you specify that you want to unmask /sys/dev. I will have a PR open later today.

rhatdan commented 3 years ago

Can you give us a specific device you are trying to add?

I did notice that we are masking /sys/dev but not /sys/device, which perhaps we should mask. We could remove these masks when users add an addiitonal device, but if this is for security reasons that we added these masks, then it seems like a fairly risky issue to unmask them for any device --device /dev/fuse for example.

rhatdan commented 3 years ago

Sadly, I added the mask for /sys/dev and can not find what triggered me adding it. I am sure it was a bugzilla or issue that asked us to mask it. But it looks like it is not masked in Moby at this point.

rhatdan commented 3 years ago

Here is the bugzilla that triggered this masking. https://bugzilla.redhat.com/show_bug.cgi?id=1772993

awerlang commented 3 years ago

@TristanCacqueray

That it is a different issue, but an additional important information as privileged mode is not even an option to workaround the absence of /sys/dev. Thus it seems like podman 2.1.1 can no longer run GPU workload.

It seems that the host display (e.g. :0) doesn't work for some reason with open-source drivers, this would be interesting to track down I guess. It does work if you use a nested server (e.g. Weston) though. See the discussion I posted before: https://github.com/mviereck/x11docker/issues/293

Also, unprivileged rootless podman runs gpu workloads for nvidia just fine, it doesn't uses /dev/dri but /dev/nvidia* instead.

paravz commented 3 years ago

@rhatdan that change broke multiple podman scenarios on "developer workstation", scenarios that worked before and still work in docker. And these scenarios require podman podman to be started with --device (or equivalent), compromising security to begin with.

rhatdan commented 3 years ago

@paravz can you give me an example? of a container that this broke?

rhatdan commented 3 years ago

@mrunalp Masking /sys/dev seems to be causing us issues. Perhaps we should just mask block devices to fix the problem?

TristanCacqueray commented 3 years ago

@rhatdan any container running the VSCode GUI is likely broken as it seems to require GPU rendering.

paravz commented 3 years ago

Any container with GPU-accelearated GUI or X-windows, ie chrome/puppeteer, etc - see x11docker examples listed here too.

DRI/Render device needs to be forwarded into container to achieve this, ie: podman run --device '/dev/dri':'/dev/dri':rw ... or podman run --device /dev/dri/renderD128 ...

hrittich commented 3 years ago

I have run into the same issue today. To figure out the problem, I was cooking up a MWE until I found this issue here. I though, sharing the example could help solving the issue. Here is the Dockerfile

FROM debian:buster
RUN DEBIAN_FRONTEND=noninteractive \
  apt-get update && \
  apt-get -y install mesa-utils xterm x11-apps xauth
CMD glxgears 

Running the following commands should show you the turning-gears demo.

sudo podman build -t glx .
xhost +local:
sudo podman run --rm -ti --volume=$XAUTHORITY:/tmp/.Xauthority --volume=/tmp/.X11-unix:/tmp/.X11-unix --env=DISPLAY=$DISPLAY --env=XAUTHORITY=/tmp/.Xauthority glx

If you add the --device /dev/dri option, i.e.,

sudo podman run --rm -ti --volume=$XAUTHORITY:/tmp/.Xauthority --volume=/tmp/.X11-unix:/tmp/.X11-unix --env=DISPLAY=$DISPLAY --env=XAUTHORITY=/tmp/.Xauthority --device /dev/dri glx

however, my entire X server freezes for a couple of seconds until glxgears crashes. Running with the --privileged=true option works fine for me.

@mrunalp Masking /sys/dev seems to be causing us issues. Perhaps we should just mask block devices to fix the problem?

If you are asking me, the cleanest solution would be to only mask devices with are not mapped into the container.

TristanCacqueray commented 3 years ago

Thanks @umohnani8!

paravz commented 3 years ago

8408 also changed default mask from "/sys/dev" to "/sys/dev/block" - this by itself (without using mask/unmask) allows to use /dev/dri and unblock previously broken X11/GPU use cases. Thanks @umohnani8 !