NVIDIA / libnvidia-container

NVIDIA container runtime library
Apache License 2.0
843 stars 205 forks source link

nvidia-container-runtime did not terminate successfully (after upgrade) #186

Open PriamX opened 2 years ago

PriamX commented 2 years ago

I'm not sure if this is the right place to report this. I also reported this over at https://github.com/NVIDIA/nvidia-container-toolkit issue # 34

It look a little bit to run this down, but after an upgrade four days ago my containers using the nivida runtime get:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: nvidia-container-runtime did not terminate successfully: exit status 1: unknown.

And none of them would run (3 out of my 30 containers are using the nvidia runtime). I'm using the centos8 libnvidia-container repo on Fedora 35.

I found these in my dnf history:

    Upgrade  libnvidia-container-devel-1.11.0-1.x86_64            @libnvidia-container
    Upgraded libnvidia-container-devel-1.10.0-1.x86_64            @@System
    Upgrade  libnvidia-container-static-1.11.0-1.x86_64           @libnvidia-container
    Upgraded libnvidia-container-static-1.10.0-1.x86_64           @@System
    Upgrade  libnvidia-container-tools-1.11.0-1.x86_64            @libnvidia-container
    Upgraded libnvidia-container-tools-1.10.0-1.x86_64            @@System
    Upgrade  libnvidia-container1-1.11.0-1.x86_64                 @libnvidia-container
    Upgraded libnvidia-container1-1.10.0-1.x86_64                 @@System
    Upgrade  libnvidia-container1-debuginfo-1.11.0-1.x86_64       @libnvidia-container
    Upgraded libnvidia-container1-debuginfo-1.10.0-1.x86_64       @@System
    Upgrade  nvidia-container-toolkit-1.11.0-1.x86_64             @libnvidia-container
    Upgraded nvidia-container-toolkit-1.10.0-1.x86_64             @@System

Rolling back from 1.11.0 to 1.10.0 and reinstalling nvidia-docker2 fixed this issue.

Some other info that may be helpful:

[root@mediaserv ~]# nvidia-docker version
NVIDIA Docker: 2.11.0
Client: Docker Engine - Community
 Version:           20.10.18
 API version:       1.41
 Go version:        go1.18.6
 Git commit:        b40c2f6
 Built:             Thu Sep  8 23:12:57 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.18
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.6
  Git commit:       e42327a
  Built:            Thu Sep  8 23:10:39 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
[root@mediaserv ~]#

And...

[root@mediaserv ~]# nvidia-smi
Sun Sep 18 11:27:21 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 29%   48C    P0    21W / 120W |     92MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    561805      C   ...diaserver/Plex Transcoder       88MiB |
+-----------------------------------------------------------------------------+
[root@mediaserv ~]#