apptainer / singularity

Singularity has been renamed to Apptainer as part of us moving the project to the Linux Foundation. This repo has been persisted as a snapshot right before the changes.
https://github.com/apptainer/apptainer
Other
2.52k stars 424 forks source link

packaging cuda drivers #1872

Closed mforde84 closed 6 years ago

mforde84 commented 6 years ago

Version of Singularity:

2.6.0

Expected behavior

Built container as a sandbox on a NFS mount point shared across multiple GPU compute nodes. Would like to include an updated driver that's compatible with this particular version of cuda runtime thats packaged in the container, as the host driver is old and incompatible. How should I go about installing the latest cuda drivers to the sandbox installation so that the container environment uses the updated driver as opposed to the older / incompatible host drivers? Or do I have to update the host drivers? Can I simply install the drivers from source to a prefix within the sandbox? Or can I modify the build so that it installs it's own cuda dependancies and uses those to interface with the card instead of the host drivers? Sorry, I know this is probably fairly basic question, I just don't have much experience with containers, so it's new territory for me in terms of best / appropriate practices.

Actual behavior

Container is drawing drivers from the host. Which is what is expected with --nv flag. However, my system drivers are incompatible with the package cuda runtime.

Steps to reproduce behavior

$ cat /etc/system-release Red Hat Enterprise Linux Server release 6.7 (Santiago) $ uname -r 2.6.32-573.12.1.el6.x86_64 $ # install nvidia driver v352.39 $ sudo singularity build --sandbox /path/to/sandbox docker://tensorflow/tensorflow/1.10.0-devel-gpu-py3 $ singularity shell -nv /path/to/sandbox Singularity tensorflow:1.10.0-devel-gpu-py3:~> nvidia-smi Thu Aug 23 00:24:41 2018 +------------------------------------------------------+ | NVIDIA-SMI 352.39 Driver Version: 352.39 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 0000:84:00.0 Off | 0 | | N/A 39C P0 58W / 149W | 22MiB / 11519MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ Singularity tensorflow:1.10.0-devel-gpu-py3:~> python3 Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information.

from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) 2018-08-23 00:26:35.424225: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-08-23 00:26:38.208490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 pciBusID: 0000:84:00.0 totalMemory: 11.25GiB freeMemory: 11.16GiB 2018-08-23 00:26:38.208576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/device_lib.py", line 41, in list_local_devices for s in pywrap_tensorflow.list_devices(session_config=session_config) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 1679, in list_devices return ListDevices(status) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 519, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

dtrudg commented 6 years ago

Hi @mforde84 - aspects of the NVIDIA/CUDA drivers are associated with the host, not the container I'm afraid. The critical pieces of the drivers are some kernel modules, which need to be loaded into the kernel running on the host. You have to update the driver on the host, as root. This is not something that Singularity can do.

The --nv flag binds runtime library portions of the driver installation from the host into the container, so that the CUDA software in the container can use those correct libraries that match with the kernel modules on the host. Unfortunately this doesn't help if the host drivers, and therefore loaded nv kernel modules are too old.

mforde84 commented 6 years ago

Yea, I read this elsewhere the other day. Thanks for following up though. Quick question, I hear there might be some experimental support for running different kernels than the host from a container? Is this at all true or roadmapped in any sense? If you could build a kernel from within a container, then you could hypothetically interface with the cards block device but with different kernel modules than the host, right? Only a hypothetical thought really. Probably wrong for a lot of reasons :)

dtrudg commented 6 years ago

Hi @mforde84 - a container definitely uses the host kernel. There have been a few things out there on the web about 'lightweight VMs' or putting a sandboxing syscall interception layer into a container (which appears as a kernel to the container)..... but all device access has to go through the host kernel, which controls the devices..... unless those devices have been passed through to a virtual machine running another kernel (at which point they can't be used by the host any more). That device pass-through into VMs is quite hard to orchestrate in the kind of usage model that containers are great for, requires a lot of privilege and trust - plus you lose e.g direct access to high speed cluster networks etc. that you enjoy with Singularity containers.

mforde84 commented 6 years ago

Very interesting. Thanks for helping explain some of the details. +1