NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.47k stars 265 forks source link

Upgrade NVidia driver without the need to restart docker daemon #169

Open remoteweb opened 2 years ago

remoteweb commented 2 years ago

1. Issue or feature description

Need to upgrade NVidia driver for host and containers without restarting docker daemon.

This applies on containers consuming GPU capabilities with Docker Nvidia Runtime.

After the host Nvidia driver is updated, and while any container using GPU is stopped, when trying to start the containers again the start fails with following error messages.

stderr: nvidia-container-cli: mount error: mount operation failed: /var/lib/docker/overlay2/307038625cd555791f9de4ea47596a7a5815ca21a7e9b6a783368637a2fb24cd/merged/proc/driver/nvidia/params/version/registry: no such file or directory: unknown"

If we restart the whole docker daemon, the container gets back online properly.

2. Steps to reproduce the issue

First make sure you are using a previous driver version, and have at least one docker container consuming GPU (eg nvidia-smi from within container)

Requirements Docker Nvidia Docker installed Nvidia Driver installed Working nvidia-smi command

docker run --name=nvidia-test --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nginx nvidia-smi

Nvidia-smi should be printing GPU information marking the test as passed.

Now let's reproduce the behavior in question

# stop container 
nvidia-test 

# unload kernel module
modprobe -r nvidia_drm

#download and install any higher linux driver
./Driver.run

# then restart container
docker start nvidia-test 

You should get a similar to

stderr: nvidia-container-cli: mount error: mount operation failed: /var/lib/docker/overlay2/307038625cd555791f9de4ea47596a7a5815ca21a7e9b6a783368637a2fb24cd/merged/proc/driver/nvidia/params/version/registry: no such file or directory: unknown"

error message.

If you restart docker deamon systemctl restart docker

Then the container can be brought back online docker start nvidia-test

Information to attach (optional if deemed irrelevant)

Device Index: 0 Device Minor: 0 Model: NVIDIA GeForce GTX 1650 Brand: GeForce GPU UUID: GPU-7878ba12-9b30-8f49-3da8-7930824af120 Bus Location: 00000000:82:00.0


 - [ ] Kernel Version

Linux 3.10.0-1160.11.1.el7.x86_64 NVIDIA/nvidia-docker#1 SMP Fri Dec 18 16:34:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux


 - [ ] Driver information from `nvidia-smi -a`

$nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Wed May 4 10:57:14 2022 Driver Version : 510.68.02 CUDA Version : 11.6

Attached GPUs : 1 GPU 00000000:82:00.0 Product Name : NVIDIA GeForce GTX 1650 Product Brand : GeForce Product Architecture : Turing Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-7878ba12-9b30-8f49-3da8-7930824af120 Minor Number : 0 VBIOS Version : 90.17.3D.00.4E MultiGPU Board : No Board ID : 0x8200 GPU Part Number : N/A Module ID : 0 Inforom Version Image Version : G001.0000.02.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x82 Device : 0x00 Domain : 0x0000 Device Id : 0x1F8210DE Bus Id : 00000000:82:00.0 Sub System Id : 0x8D921462 GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 40 % Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 4096 MiB Reserved : 184 MiB Used : 0 MiB Free : 3911 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 49 C GPU Shutdown Temp : 97 C GPU Slowdown Temp : 94 C GPU Max Operating Temp : 92 C GPU Target Temperature : 83 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 18.87 W Power Limit : 75.00 W Default Power Limit : 75.00 W Enforced Power Limit : 75.00 W Min Power Limit : 45.00 W Max Power Limit : 75.00 W Clocks Graphics : 1485 MHz SM : 1485 MHz Memory : 4001 MHz Video : 1380 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 2130 MHz SM : 2130 MHz Memory : 4001 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Processes : None


 - [ ] Docker version from `docker version`

docker version Client: Docker Engine - Community Version: 20.10.2 API version: 1.41 Go version: go1.13.15 Git commit: 2291f61 Built: Mon Dec 28 16:17:48 2020 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.2 API version: 1.41 (minimum version 1.12) Go version: go1.13.15 Git commit: 8891c58 Built: Mon Dec 28 16:16:13 2020 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.3 GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b runc: Version: 1.0.0-rc92 GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff docker-init: Version: 0.19.0 GitCommit: de40ad0


 - [ ] NVIDIA container library version from `nvidia-container-cli -V`

nvidia-container-cli -V version: 1.3.1 build date: 2020-12-14T14:18+0000 build revision: ac02636a318fe7dcc71eaeb3cc55d0c8541c1072 build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44) build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

elezar commented 2 years ago

@remoteweb when a create command is intercepted, the NVIDIA Container Library performs mount operations in the containers namespace. One of these includes tmpfs mounts over the following three files:

/proc/driver/nvidia/params
/proc/driver/nvidia/version
/proc/driver/nvidia/registry

The error you are seeing seems to indicate that the /proc/driver/nvidia folder does not exist on the host. Can you confirm that it does exist?

Note that there fix released in the NVIDIA Container Toolkit 1.6.0 that would address the wording of the mount error that you are seeing.

remoteweb commented 2 years ago

To be more specific. We updated from 440 version to 510.

So 440 driver looks like this

~ $ls -al /proc/driver/nvidia
total 0
dr-xr-xr-x. 5 root root 0 Oct  2  2021 .
dr-xr-xr-x. 6 root root 0 Oct  2  2021 ..
dr-xr-xr-x. 3 root root 0 Oct  2  2021 gpus
-r--r--r--. 1 root root 0 May 18 08:13 params
dr-xr-xr-x. 2 root root 0 May 18 08:13 patches
-rw-r--r--. 1 root root 0 May 18 08:13 registry
-rw-r--r--. 1 root root 0 May 18 08:13 suspend
-rw-r--r--. 1 root root 0 May 18 08:13 suspend_depth
-r--r--r--. 1 root root 0 May 18 08:13 version
dr-xr-xr-x. 2 root root 0 May 18 08:13 warnings

and 510 looks like this

~ #ls -la /proc/driver/nvidia
total 0
dr-xr-xr-x 6 root root 0 Dec 21 22:46 .
dr-xr-xr-x 7 root root 0 Dec 21 22:46 ..
dr-xr-xr-x 4 root root 0 May 18 08:12 capabilities
dr-xr-xr-x 3 root root 0 Dec 21 22:46 gpus
-r--r--r-- 1 root root 0 May 18 08:12 params
dr-xr-xr-x 2 root root 0 May 18 08:12 patches
-rw-r--r-- 1 root root 0 May 18 08:12 registry
-rw-r--r-- 1 root root 0 May 18 08:12 suspend
-rw-r--r-- 1 root root 0 May 18 08:12 suspend_depth
-r--r--r-- 1 root root 0 May 18 08:12 version
dr-xr-xr-x 2 root root 0 May 18 08:12 warnings

These folders do exist after the upgrade.

leoheck commented 2 years ago

I am interested to know this. Every time NVIDIA upgrades my users complain docker does not work. Their bad solution was to disable NVIDIA updates. This is the worst solution. Did you guys figure out how to make docker run after an upgrade on NVIDIA drivers without having to reboot the entire system?

elezar commented 2 years ago

I am interested to know this. Every time NVIDIA upgrades my users complain docker does not work. Their bad solution was to disable NVIDIA updates. This is the worst solution. Did you guys figure out how to make docker run after an upgrade on NVIDIA drivers without having to reboot the entire system?

When you say that "docker does not work" does this mean that new containers cannot be started, or that existing containers stop working? When upgrading the driver, the libraries that are mounted into the running containers are removed and replaced by updated versions. With this in mind keeping docker containers using the drivers running through an upgrade is not a supported use case, nor is stopping and then restarting them since this may still reference the old libraries.

Restarting the docker containers (once terminated) should pick up the new libraries and binaries and mount these into the container instead.

remoteweb commented 2 years ago

Just thought this might help others as well.

The behaviour that we reported in the main issue (the need of docker daemon restart after nvidia driver upgrade), isn't happening when upgraded from 470 to 515. on identical systems to ones included in our initial report.

For us the following worked as expected.

1. Stop all containers using nvidia drivers

2. Unload Nvidia Kernel modules

# Unload Nvidia kernel modules
modprobe -r nvidia_drm
modprobe -r nvidia_uvm
modprobe -r nvidia_modeset
modbrobe -r nvidia

3. Install new Nvidia drivers

4. Start previously stopped containers (this time the start up is not throwing errors)

5. Confirm the new driver by running exec the nvidia-smi within the containters.

leoheck commented 2 years ago

When you say that "docker does not work" does this mean that new containers cannot be started, or that existing containers stop working? When upgrading the driver, the libraries that are mounted into the running containers are removed and replaced by updated versions. With this in mind keeping docker containers using the drivers running through an upgrade is not a supported use case, nor is stopping and then restarting them since this may still reference the old libraries.

Restarting the docker containers (once terminated) should pick up the new libraries and binaries and mount these into the container instead.

@elezar I don't know for sure unfortunately since this is being handled by my co-workers without telling me what they are doing, but I believe that it is that the existing containers do not work anymore after an upgrade. They use to fix this by rebooting the server, which is something I would like to avoid. But you are saying that just restarting the container may fix the issue. Then I will have to monitor if this happens again to check.

@remoteweb thanks, but this procedure is not good for me. I update Nvidia drivers with regular system updates. I won't even try to check first if there are drivers to be updated to do all this process... I just want to keep the system updated, and it should not break running things. Maybe docker could have a rule somewhere that does something like this automatically when drivers are being upgraded.

remoteweb commented 2 years ago

@leoheck if the uptime is required, you need to re-design how your architecture works. For example, you could deploy a new container with new driver release, and kill the old one once the new is up. Containers normally should be stateless.

leoheck commented 2 years ago

This makes sense, this is what I came here to understand. Unfortunately, I did not see the issue myself yet!

MtDalPizzol commented 1 year ago

Hi, Folks... I'm facing some related issue here. I'm using Docker Desktop on Windows 11 to run some containers integrated with the VSCode Devcontainers feature.

After manually installing NVidea's drivers, I'm unable to start any of my existing containers and I'm receiving the following error.

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

In the system tray, it says that Docker Desktop is running, but nothing is working anymore.

I'm clueless even from where I should start debugging this.