NVIDIA / nvidia-container-runtime

NVIDIA container runtime
Apache License 2.0
1.1k stars 159 forks source link

SELinux Module for NVIDIA containers #42

Closed zvonkok closed 4 years ago

zvonkok commented 5 years ago

When we run NVIDIA containers on a SELinux enabled distribution we need a separate SELinux module to run the container contained. Without a SELinux module we have to run the container privileged as this is the only way to allow specific SELinux contexts to interact (read, write, chattr, ...) with the files mounted into the container.

A container running privileged will get the spc_t label that is allowed to rw, chattr of base types. The base types (device_t, bin_t, proc_t, ...) are introduced by the bind mounts of the hook. A bind mount cannot have two different SELinux contexts as SELinux operates on inode level.

I have created the following SELinux nvidia-container.te that works with podman/cri-o/docker.

A prerequisit for the SELinux module to work correctly is to ensure that the labels are correct for the mounted files. Therefore I have added a additional line to the oci-nvidia-hook where I am running a

nvidia-container-cli -k list | restorecon -v -f -

With this, everytime a container is started the files to be mounted will have the correct SELinux label and the SELinux will work.

Now I can run NVIDIA containers without the privileged , can cap-drop=ALL capabilites and security-opt=no-new-privileges.

podman run  --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t \
            --rm -it docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
docker run  --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t \
            --rm -it docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1

podman run  --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t \
            --rm -it docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
docker run  --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t \
            --rm -it docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1
zvonkok commented 5 years ago

Besided restoring the context of NVIDIA files for mounting, one crucial part of the story is the correct label of /var/lib/kubelet/.* The label has to be container_file_t, The device-plugin reads/communicates with kubelet.sock and kubelet_internal_checkpoint.

With the above mentioned module it is possible to run the device-plugin with a restricted SCC and with

        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
          seLinuxOptions:
            type: nvidia_container_t

No need to run the device-plugin or the gpu workload privileged in SELinux context.

3XX0 commented 5 years ago

I don't understand why do you need more than dev_rw_xserver_misc(nvidia_container_t), the runtime hook shouldn't be affected by the container policy so why does your policy need rules for the runtime.

zvonkok commented 5 years ago
# /usr/share/selinux/devel/include

policy_module(nvidia-container, 0.1)

gen_require(`
        type container_runtime_tmpfs_t;
        type xserver_exec_t;
')

I am basing nvidia_container_t of container_t and the next rule allows the nvidia container to exit cleanly.

container_domain_template(nvidia_container)
allow nvidia_container_t container_runtime_t:process sigchld;

The hook mounts /proc/driver/nvidia/gpus/0000:00:1d as a tmpfs in the container that gets a container_runtime_tmpfs_t label, so one has to allow nvidia_container_t to read, list, directories and files that have this label.

# --- podman/docker
getattr_dirs_pattern(nvidia_container_t, container_runtime_tmpfs_t, container_runtime_tmpfs_t)
list_dirs_pattern(nvidia_container_t, container_runtime_tmpfs_t, container_runtime_tmpfs_t)
read_files_pattern(nvidia_container_t, container_runtime_tmpfs_t, container_runtime_tmpfs_t)

The bin files mounted by the hook have xserver_exec_t label, the next rule allows nvidia_container_t to access this labels and execute them.

# --- running nvidia-smi
allow nvidia_container_t xserver_exec_t:file { entrypoint execute getattr };

This should be clear, the devices have xserver_misc_device_t so again allow nvidia_container_t to access the devices.

# --- allow nvidia_container_t xserver_misc_device_t:chr_file { getattr ioctl open read write };
# --- alloc mem, ... /dev/nvidia*
dev_rw_xserver_misc(nvidia_container_t)

There is currently no problem with the libraries container_t or nvidia_container_t can read lib_t or container_file_t.

The hook creates symlinks of each library and the symlinks get the correct label (container_file_t) inherited by the parent folder.

A symlink has an own inode and hence gets an own selinux label.

root@e67d1214d198:/usr/lib/x86_64-linux-gnu# ls -lZ libcuda.so.1 
lrwxrwxrwx. 1 root root system_u:object_r:container_file_t:s0:c301,c422 17 Oct 24 18:25 libcuda.so.1 -> libcuda.so.410.48

This does not mean that you can create symlinks for a file that you're not able to read with a correct label, the type reading the symlinks must have permissions to read from the symlink src and dst.

flx42 commented 5 years ago

I believe @3XX0 expected the hook to run with context unconfined_u:unconfined_r:unconfined_t. But I just checked and it's indeed system_u:system_r:container_runtime_t.

zvonkok commented 5 years ago

Beware if you're running with podman, you will need at least container-selinux >= 2.73. Prior to that version podman will run the hook with the following context: unconfined_u:unconfined_r:xserver_t that is fixed by recent container-selinux packages.

This will mount e.g /etc/nvidia/nvidia-application-profiles-rc.d/ as unconfined_u:object_r:xserver_tmpfs_t:s0 so the policy will not work.

3XX0 commented 5 years ago

No what I would like to know is why do we need the containerruntime* (i.e. all the non xserver*) rules in the first place. I mean tmpfs/sigchld handling is pretty basic and should have been inherited from the container domain.

But really, the problem I have with this patch is that you are assuming that the host driver files carry xserver contexts and we can't really be opinionated about that. So I see 3 options:

3XX0 commented 5 years ago

So I just looked into it and I think this is due to the fact that your policy doesn't have the svirt_sandbox_domain attribute:

   container_t
      corenet_unlabeled_type
      domain
      kernel_system_state_reader
      mcs_constrained_type
      process_user_target
      container_domain
      container_net_domain
      syslog_client_type
      pcmcia_typeattr_7
      pcmcia_typeattr_6
      pcmcia_typeattr_5
      pcmcia_typeattr_4
      pcmcia_typeattr_3
      pcmcia_typeattr_2
      pcmcia_typeattr_1
      sandbox_net_domain
      sandbox_caps_domain
      svirt_sandbox_domain
   Aliases
      svirt_lxc_net_t
zvonkok commented 5 years ago

You're right. I've update the policy for the missing attributes, see updated nvidia-container.te.

But still the problem with tmpfs stays because container_domain is only allowed to dir read.

 container_domain container_runtime_tmpfs_t:dir read;

But the container is doing more than just dir read, thats why I have added the other rules. container_runtime_t can do getattr, list_dirs, read_files_pattern, but container_domain cannot.

Does the /proc/driver/nvidia/gpus/... need to be mounted as a tmpfs? proc_t can be read by container_t just fine.

The host labels are another point of discussion. The default selinux-policy is labelling bin files as xserver_bin_t and the devices as xserver_misc_device_t. The reason why i rely on these types is: I do not want to give the nvidia container access to base-types. I want to contain it by the default rules and only allow the access to xserver_* .

3XX0 commented 5 years ago

Oh right, I assumed it was already the case but thinking more about it, runc always bind mounts on top of its container_runtime_tmpfs_t files so I guess it never needs read access like we do. That's what I have on my end:

   allow container_domain container_runtime_tmpfs_t : sock_file { write getattr append open } ; 
   allow container_domain container_runtime_tmpfs_t : lnk_file { read getattr } ; 
   allow svirt_sandbox_domain file_type : file entrypoint ; 
   allow svirt_sandbox_domain file_type : dir { getattr search open } ; 
   allow container_domain file_type : filesystem getattr ; 
   allow container_domain container_runtime_tmpfs_t : dir { getattr search open } ; 
   allow svirt_sandbox_domain file_type : filesystem getattr ; 

So we might just be missing file { open read } and dir { read } unless search is sufficient. It would be nice to have that by default in container-selinux though.

I understand why you do it, but coming back to our options, I would prefer we provide better file contexts than these ones in the first place (e.g. nvidia_device_t, nvidia_exec_t):

/usr/(.*/)?nvidia/.+\.so(\..*)?                    regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/(.*/)?lib(64)?(/.*)?/nvidia/.+\.so(\..*)?     regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib.*/libnvidia\.so(\.[^/]*)*                 regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib(/.*)?/nvidia/.+\.so(\..*)?                regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib(/.*)?/nvidia_drv.*\.so(\.[^/]*)*          regular file       system_u:object_r:textrel_shlib_t:s0 
/dev/nvidia.*                                      character device   system_u:object_r:xserver_misc_device_t:s0 
/usr/bin/nvidia.*                                  regular file       system_u:object_r:xserver_exec_t:s0 
/usr/lib/nvidia.*\.so(\.[^/]*)*                    regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib/libnvidia\.so(\.[^/]*)*                   regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib/nvidia-graphics(-[^/]*/)?libXvMCNVIDIA\.so.* regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib/nvidia-graphics(-[^/]*/)?libnvidia.*\.so(\.[^/]*)* regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib/nvidia-graphics(-[^/]*/)?libGL(core)?\.so(\.[^/]*)* regular file       system_u:object_r:textrel_shlib_t:s0 
/var/log/nvidia-installer\.log.*                   regular file       system_u:object_r:xserver_log_t:s0 
/usr/lib/vdpau/libvdpau_nvidia\.so.*               regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib/xorg/modules/extensions/nvidia(-[^/]*)?/libglx\.so(\.[^/]*)* regular file       system_u:object_r:textrel_shlib_t:s0 
/usr/lib/xorg/modules/drivers/nvidia_drv\.o        regular file       system_u:object_r:textrel_shlib_t:s0 

If we were to rely on the default xserver ones, I suggest we write an extension to container-selinux which may prove to be useful for other people (e.g. forwarding X inside containers)

zvonkok commented 5 years ago

Totally on your side I do not like the xserver_* things either, it was just as a first step to have something "working".

We're currenltly in a position to create here an example workflow how to enable hw acclereators in general on a system with SELinux. If other hw vendors follow the path with a similar method to provide needed libraries (prestart-hook, bind mounts) we could create here "generic" rules for labeling and a policy for accelerators.

  1. We need to take care of the "correct" labelling for the host
  2. On top of that create a policy that enables containers to interact with these labels (be them nvidia* or xserver* does not matter)
wzhanw commented 5 years ago

@3XX0 and @zvonkok , I try to setup the openshift+nvidia docker hook+SELinux environment for AI training job, I found some AI training Frameworks (Pytorch) want to write something to the /dev/shm in the gpu container, but after I run the container with "container_t" or @zvonkok's "nvidia_container_t", the /dev/shm in the container is not accessible by the training code. I'm new to the SELinux, do you know how to configure the rule? Thank you.

RenaudWasTaken commented 4 years ago

I think this is done, there is an example selinux policy for DGX available here: https://github.com/NVIDIA/dgx-selinux

qhaas commented 4 years ago

I think this is done, there is an example selinux policy for DGX available here: https://github.com/NVIDIA/dgx-selinux

That should likely be generalized to non-DGX EL7 / EL8 environments and made part of this project's packages.