Updating CPU quota causes NVML unknown error

dvenza commented 6 years ago

I'm testing nvidia-docker 2, starting containers through Zoe Analytics, that uses the network Docker API. What Zoe does is to dynamically set CPU quotas to redistribute spare capacity, but it makes nvidia-docker break down:

Start a container (the nvidia plugin is set as default in daemon.json):

$ docker run -d -e NVIDIA_VISIBLE_DEVICES=all -p 8888 gcr.io/tensorflow/tensorflow:1.3.0-gpu-py3

Test with nvidia-smi (it works):

$ docker exec -it 9e nvidia-smi
Thu Nov  2 08:03:25 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   26C    P0    31W / 250W |      0MiB / 16276MiB |      0%      Default |

[...]

Change the CPU quota:

$ docker update --cpu-quota 640000 9e

Test with nvidia-smi (it breaks):

$ docker exec -it 9e nvidia-smi
Failed to initialize NVML: Unknown Error

If I set the cpu quota at the beginning, it works.
I tried with different values for the quota, it always breaks
I could find no messages in the logs
The same happens updating the memory soft limit (--memory-reservation)

3XX0 commented 6 years ago

Good catch, it looks like Docker is resetting all the cgroups when it only needs to update one (CPU quota in this case). Not sure how we can workaround that though.

mrjackbo commented 5 years ago

Has there been any progress on this? It seems I ran into the same problem while trying to set up the Kubernetes cpu-manager with „static“ policy. (https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/)

klueska commented 5 years ago

@3XX0 I think it is unlikely that this will ever be addressed upstream.

From docker's perspective, they own and control all of the cgroups/devices set up for the containers they launch. If something comes along (in this case, libnvidia-container) and changes those cgroups/device settings outside of docker, then docker should be free to resolve these discrepancies in order to keep its state in sync.

The long-term solution should probably involve making libnvidia-container "docker-aware" in some way so that it can update the necessary state changes via:

https://docs.docker.com/engine/api/v1.25/#operation/ContainerUpdate

I know this goes against the current design (i.e. making libnvidia-container container runtime agnostic), but I don't see any other way around this.

For example, if you do a docker inspect on a functioning GPU-enabled container today, you will see that its device list is empty, even though it clearly has the nvidia devices injected into it and the cgroup access to those devices is set up properly. However, once some external entity hits docker's ContainerUpdate API (whether directly via the CLI or through an API call like the CPUManager does in Kubernetes), docker resolves this empty device list to disk, essentially "undoing" what libnvidia-container had set up in regards to these devices.

@RenaudWasTaken How does the new --gpu flag for docker handle the fact that libnvidia-container is messing with cgroups/devices outside of docker's control?

klueska commented 5 years ago

@mrjackbo if your setup is constrained such that GPUs will only ever be used by containers that have CPUsets assigned to them via the static allocation policy, then the following patch to Kubernetes will avoid having docker update its cgroups after these containers are initially launched.

diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager.go b/pkg/kubelet/cm/cpumanager/cpu_manager.go
index 4ccddd5..ff3fbdf 100644
--- a/pkg/kubelet/cm/cpumanager/cpu_manager.go
+++ b/pkg/kubelet/cm/cpumanager/cpu_manager.go
@@ -242,7 +242,8 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                        // - policy does not want to track the container
                        // - kubelet has just been restarted - and there is no previous state file
                        // - container has been removed from state by RemoveContainer call (DeletionTimestamp is set)
-                       if _, ok := m.state.GetCPUSet(containerID); !ok {
+                       cset, ok := m.state.GetCPUSet(containerID)
+                       if !ok {
                                if status.Phase == v1.PodRunning && pod.DeletionTimestamp == nil {
                                        klog.V(4).Infof("[cpumanager] reconcileState: container is not present in state - trying to add (pod: %s, container: %s, container id: %s)", pod.Name, container.Name, containerID)
                                        err := m.AddContainer(pod, &container, containerID)
@@ -258,7 +259,13 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                                }
                        }

-                       cset := m.state.GetCPUSetOrDefault(containerID)
+                       if !cset.IsEmpty() && m.policy.Name() == string(PolicyStatic) {
+                               klog.V(4).Infof("[cpumanager] reconcileState: skipping container; assigned cpuset unchanged (pod: %s, container: %s, container id: %s, cpuset: \"%v\")", pod.Name, container.Name, containerID, cset)
+                               success = append(success, reconciledContainer{pod.Name, container.Name, containerID})
+                               continue
+                       }
+
+                       cset = m.state.GetDefaultCPUSet()
                        if cset.IsEmpty() {
                                // NOTE: This should not happen outside of tests.
                                klog.Infof("[cpumanager] reconcileState: skipping container; assigned cpuset is empty (pod: %s, container: %s)", pod.Name, container.Name)

arlofaria commented 5 years ago

Here's a workaround that might be helpful: docker run --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 ... (Replace/repeat nvidia0 with other/more devices as needed.)

This seems to fix the problem with both --runtime=nvidia or the newer --gpus option.

lucidyan commented 7 months ago

@elezar @klueska

In the official guide on how to deal with that problem written:

You can use the following steps to confirm that your system is affected. After you implement one of the workarounds (mentioned in the next section), you can repeat the steps to confirm that the error is no longer reproducible.

For Docker environments Run a test container:
$ docker run -d --rm --runtime=nvidia --gpus all \
    --device=/dev/nvidia-uvm \
    --device=/dev/nvidia-uvm-tools \
    --device=/dev/nvidia-modeset \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
    nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"

But logically and in my experience, we should not use --device command flags because it fixes the problem instead of making the bug reproducible. Am I missing something?

klueska commented 7 months ago

For the issue related to missing /dev/char/*devices, the bug would occur even if you added the --device nodes as above.

lucidyan commented 7 months ago

For the issue related to missing /dev/char/*devices, the bug would occur even if you added the --device nodes as above.

It's interesting because I was able to reproduce the NVML bug only without --device arguments, as stated above. And this was also fixed by changing systemd to cgroupfs manager in the docker daemon config.

klueska commented 7 months ago

Don't get me wrong -- it will definitely happen if you don't pass --device. But without the /dev/char symlinks it will also happen even if you do pass --device.

zlianzhuang commented 4 months ago

i set cgroup-driver=cgroupfs on docker and k8s to fix my cluster。

NVIDIA / nvidia-container-toolkit

Updating CPU quota causes NVML unknown error #138