NVIDIA / nvidia-container-runtime

NVIDIA container runtime
Apache License 2.0
1.1k stars 159 forks source link

Support for RHEL8.4 (ppc64le) #146

Closed mgiessing closed 2 years ago

mgiessing commented 3 years ago

Hi, is there an estimated date when support for rhel8.4 will be there?

Thanks!

elezar commented 3 years ago

Hi @mgiessing. Do you mean from a packaging perspective? Have you tried to use the RHEL8.3 (or centos8) packages?

mgiessing commented 3 years ago

Yes, I mean from a packaging as well as functional perspective. If I try to replace $distribution (which is rhel8.4) with rhel8.3 here:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo
sudo yum install -y nvidia-container-runtime-hook

I can install the runtime hook, but encounter this error:

[root@p630-met1 ~]# docker run --rm docker.io/nvidia/cuda-ppc64le:11.3.1-runtime-ubi8 nvidia-smi
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
Error: OCI runtime error: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig terminated with signal 4

Using a RHEL8.3 (bare-metal) distribution works fine.

Here some further information about CUDA, driver & the system:

[root@p630-met1 ~]# uname -a
Linux p630-met1 4.18.0-305.10.2.el8_4.ppc64le #1 SMP Mon Jul 12 04:35:57 EDT 2021 ppc64le ppc64le ppc64le GNU/Linux

[root@p630-met1 ~]# cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.4 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.4"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.4"

[root@p630-met1 ~]# nvidia-smi
Tue Jul 27 12:22:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Tesla V1...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   28C    P0    40W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA Tesla V1...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   32C    P0    40W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA Tesla V1...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   28C    P0    37W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA Tesla V1...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   31C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Thanks!

dllehr81 commented 3 years ago

I've hit the same issue. It's related to the changes in RHEL8.4. The NVIDIA toolkit stack installs correctly, however, we receive a SIGILL when attempting to start a container. I have straces etc. if anyone wants to take a look. I'll try to recompile the toolkit on 8.4 itself in case there's a library or linking issue.

dllehr81 commented 3 years ago

@mgiessing Looks like https://github.com/NVIDIA/libnvidia-container/pull/143 will address this issue

AlrinXNI commented 3 years ago

I'v hit the same issue too, but I'm using CentOS8.4 with CUDA 11.4

dllehr81 commented 3 years ago

Looks like they finally merged the fix a few hours ago.....not sure why the delay there, but hopefully it'll be in the upcoming release of libnvidia-container

elezar commented 3 years ago

Hi @dllehr81 and @mgiessing

We have published libnvidia-container 1.5.1~rc.1 with this change to our experimental repositories. Let us know if this addresses the problems that you are seeing. We expect to promote this to stable in the near future.

dllehr81 commented 3 years ago

Thanks Evan! We appreciate it! I built a one-off libnvidia with the proposed solution and didn't have a problem..I'll try your rc.1 and see how it looks!

klueska commented 2 years ago

The full libnvidia-container 1.5.1 release is now out as well.

elezar commented 2 years ago

@mgiessing / @dllehr81 have you been able to test the new releases? We have also added symlinks to centos8 for rhel8.4 so that this can be accessed without manually specifying the distribution as centos8 or rhel8.3.

Please close this issue if the error has been resolved.

mgiessing commented 2 years ago

Sorry I missed this one. As mentioned by Doug the issue has been resolved with NVIDIA/libnvidia-container#143 and also the symlinks with RHEL8.4 work now. Thanks!