Closed zvonkok closed 4 years ago
Besided restoring the context of NVIDIA files for mounting, one crucial part of the story is the correct label of /var/lib/kubelet/.*
The label has to be container_file_t
, The device-plugin reads/communicates with kubelet.sock
and kubelet_internal_checkpoint
.
With the above mentioned module it is possible to run the device-plugin with a restricted SCC and with
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
seLinuxOptions:
type: nvidia_container_t
No need to run the device-plugin or the gpu workload privileged in SELinux context.
I don't understand why do you need more than dev_rw_xserver_misc(nvidia_container_t)
, the runtime hook shouldn't be affected by the container policy so why does your policy need rules for the runtime.
# /usr/share/selinux/devel/include
policy_module(nvidia-container, 0.1)
gen_require(`
type container_runtime_tmpfs_t;
type xserver_exec_t;
')
I am basing nvidia_container_t
of container_t
and the next rule allows the nvidia container to exit cleanly.
container_domain_template(nvidia_container)
allow nvidia_container_t container_runtime_t:process sigchld;
The hook mounts /proc/driver/nvidia/gpus/0000:00:1d
as a tmpfs in the container that gets a container_runtime_tmpfs_t
label, so one has to allow nvidia_container_t
to read, list, directories
and files that have this label.
# --- podman/docker
getattr_dirs_pattern(nvidia_container_t, container_runtime_tmpfs_t, container_runtime_tmpfs_t)
list_dirs_pattern(nvidia_container_t, container_runtime_tmpfs_t, container_runtime_tmpfs_t)
read_files_pattern(nvidia_container_t, container_runtime_tmpfs_t, container_runtime_tmpfs_t)
The bin files mounted by the hook have xserver_exec_t
label, the next rule allows nvidia_container_t
to access this labels and execute them.
# --- running nvidia-smi
allow nvidia_container_t xserver_exec_t:file { entrypoint execute getattr };
This should be clear, the devices have xserver_misc_device_t
so again allow nvidia_container_t
to access the devices.
# --- allow nvidia_container_t xserver_misc_device_t:chr_file { getattr ioctl open read write };
# --- alloc mem, ... /dev/nvidia*
dev_rw_xserver_misc(nvidia_container_t)
There is currently no problem with the libraries container_t
or nvidia_container_t
can read lib_t
or container_file_t
.
The hook creates symlinks of each library and the symlinks get the correct label (container_file_t) inherited by the parent folder.
A symlink has an own inode and hence gets an own selinux label.
root@e67d1214d198:/usr/lib/x86_64-linux-gnu# ls -lZ libcuda.so.1
lrwxrwxrwx. 1 root root system_u:object_r:container_file_t:s0:c301,c422 17 Oct 24 18:25 libcuda.so.1 -> libcuda.so.410.48
This does not mean that you can create symlinks for a file that you're not able to read with a correct label, the type reading the symlinks must have permissions to read from the symlink src and dst.
I believe @3XX0 expected the hook to run with context unconfined_u:unconfined_r:unconfined_t
. But I just checked and it's indeed system_u:system_r:container_runtime_t
.
Beware if you're running with podman, you will need at least container-selinux >= 2.73
. Prior to that version podman
will run the hook with the following context: unconfined_u:unconfined_r:xserver_t
that is fixed by recent container-selinux
packages.
This will mount e.g /etc/nvidia/nvidia-application-profiles-rc.d/
as unconfined_u:object_r:xserver_tmpfs_t:s0
so the policy will not work.
No what I would like to know is why do we need the containerruntime* (i.e. all the non xserver*) rules in the first place. I mean tmpfs/sigchld handling is pretty basic and should have been inherited from the container domain.
But really, the problem I have with this patch is that you are assuming that the host driver files carry xserver contexts and we can't really be opinionated about that. So I see 3 options:
So I just looked into it and I think this is due to the fact that your policy doesn't have the svirt_sandbox_domain
attribute:
container_t
corenet_unlabeled_type
domain
kernel_system_state_reader
mcs_constrained_type
process_user_target
container_domain
container_net_domain
syslog_client_type
pcmcia_typeattr_7
pcmcia_typeattr_6
pcmcia_typeattr_5
pcmcia_typeattr_4
pcmcia_typeattr_3
pcmcia_typeattr_2
pcmcia_typeattr_1
sandbox_net_domain
sandbox_caps_domain
svirt_sandbox_domain
Aliases
svirt_lxc_net_t
You're right. I've update the policy for the missing attributes, see updated nvidia-container.te.
But still the problem with tmpfs stays because container_domain is only allowed to dir read.
container_domain container_runtime_tmpfs_t:dir read;
But the container is doing more than just dir read, thats why I have added the other rules.
container_runtime_t
can do getattr, list_dirs, read_files_pattern, but container_domain
cannot.
Does the /proc/driver/nvidia/gpus/... need to be mounted as a tmpfs? proc_t
can be read by
container_t
just fine.
The host labels are another point of discussion. The default selinux-policy is labelling bin files
as xserver_bin_t
and the devices as xserver_misc_device_t
. The reason why i rely on these
types is: I do not want to give the nvidia container access to base-types. I want to contain it by
the default rules and only allow the access to xserver_* .
Oh right, I assumed it was already the case but thinking more about it, runc always bind mounts on top of its container_runtime_tmpfs_t
files so I guess it never needs read access like we do.
That's what I have on my end:
allow container_domain container_runtime_tmpfs_t : sock_file { write getattr append open } ;
allow container_domain container_runtime_tmpfs_t : lnk_file { read getattr } ;
allow svirt_sandbox_domain file_type : file entrypoint ;
allow svirt_sandbox_domain file_type : dir { getattr search open } ;
allow container_domain file_type : filesystem getattr ;
allow container_domain container_runtime_tmpfs_t : dir { getattr search open } ;
allow svirt_sandbox_domain file_type : filesystem getattr ;
So we might just be missing file { open read }
and dir { read }
unless search
is sufficient. It would be nice to have that by default in container-selinux though.
I understand why you do it, but coming back to our options, I would prefer we provide better file contexts than these ones in the first place (e.g. nvidia_device_t
, nvidia_exec_t
):
/usr/(.*/)?nvidia/.+\.so(\..*)? regular file system_u:object_r:textrel_shlib_t:s0
/usr/(.*/)?lib(64)?(/.*)?/nvidia/.+\.so(\..*)? regular file system_u:object_r:textrel_shlib_t:s0
/usr/lib.*/libnvidia\.so(\.[^/]*)* regular file system_u:object_r:textrel_shlib_t:s0
/usr/lib(/.*)?/nvidia/.+\.so(\..*)? regular file system_u:object_r:textrel_shlib_t:s0
/usr/lib(/.*)?/nvidia_drv.*\.so(\.[^/]*)* regular file system_u:object_r:textrel_shlib_t:s0
/dev/nvidia.* character device system_u:object_r:xserver_misc_device_t:s0
/usr/bin/nvidia.* regular file system_u:object_r:xserver_exec_t:s0
/usr/lib/nvidia.*\.so(\.[^/]*)* regular file system_u:object_r:textrel_shlib_t:s0
/usr/lib/libnvidia\.so(\.[^/]*)* regular file system_u:object_r:textrel_shlib_t:s0
/usr/lib/nvidia-graphics(-[^/]*/)?libXvMCNVIDIA\.so.* regular file system_u:object_r:textrel_shlib_t:s0
/usr/lib/nvidia-graphics(-[^/]*/)?libnvidia.*\.so(\.[^/]*)* regular file system_u:object_r:textrel_shlib_t:s0
/usr/lib/nvidia-graphics(-[^/]*/)?libGL(core)?\.so(\.[^/]*)* regular file system_u:object_r:textrel_shlib_t:s0
/var/log/nvidia-installer\.log.* regular file system_u:object_r:xserver_log_t:s0
/usr/lib/vdpau/libvdpau_nvidia\.so.* regular file system_u:object_r:textrel_shlib_t:s0
/usr/lib/xorg/modules/extensions/nvidia(-[^/]*)?/libglx\.so(\.[^/]*)* regular file system_u:object_r:textrel_shlib_t:s0
/usr/lib/xorg/modules/drivers/nvidia_drv\.o regular file system_u:object_r:textrel_shlib_t:s0
If we were to rely on the default xserver
ones, I suggest we write an extension to container-selinux which may prove to be useful for other people (e.g. forwarding X inside containers)
Totally on your side I do not like the xserver_*
things either, it was just as a first step to have something "working".
We're currenltly in a position to create here an example workflow how to enable hw acclereators in general on a system with SELinux. If other hw vendors follow the path with a similar method to provide needed libraries (prestart-hook, bind mounts) we could create here "generic" rules for labeling and a policy for accelerators.
@3XX0 and @zvonkok , I try to setup the openshift+nvidia docker hook+SELinux environment for AI training job, I found some AI training Frameworks (Pytorch) want to write something to the /dev/shm in the gpu container, but after I run the container with "container_t" or @zvonkok's "nvidia_container_t", the /dev/shm in the container is not accessible by the training code. I'm new to the SELinux, do you know how to configure the rule? Thank you.
I think this is done, there is an example selinux policy for DGX available here: https://github.com/NVIDIA/dgx-selinux
I think this is done, there is an example selinux policy for DGX available here: https://github.com/NVIDIA/dgx-selinux
That should likely be generalized to non-DGX EL7 / EL8 environments and made part of this project's packages.
When we run NVIDIA containers on a SELinux enabled distribution we need a separate SELinux module to run the container contained. Without a SELinux module we have to run the container
privileged
as this is the only way to allow specific SELinux contexts to interact (read, write, chattr, ...) with the files mounted into the container.A container running
privileged
will get thespc_t
label that is allowed to rw, chattr of base types. The base types (device_t, bin_t, proc_t, ...) are introduced by the bind mounts of the hook. A bind mount cannot have two different SELinux contexts as SELinux operates on inode level.I have created the following SELinux nvidia-container.te that works with podman/cri-o/docker.
A prerequisit for the SELinux module to work correctly is to ensure that the labels are correct for the mounted files. Therefore I have added a additional line to the oci-nvidia-hook where I am running a
With this, everytime a container is started the files to be mounted will have the correct SELinux label and the SELinux will work.
Now I can run NVIDIA containers without the
privileged
, cancap-drop=ALL
capabilites andsecurity-opt=no-new-privileges
.