NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.8k stars 292 forks source link

failure to install on Rocky Linux 8.4 (CentOS 8 clone) #264

Open FlorinAndrei opened 3 years ago

FlorinAndrei commented 3 years ago

CentOS 8 is going end-of-life very soon. We are migrating to Rocky Linux 8 instead for the k8s hosts. I am trying to install the GPU operator on Rocky Linux 8.4 but one of the components fails.

Kubernetes 1.20.5

Docker 20.10.3

1. Issue or feature description

helm repo update
helm install gpu-operator nvidia/gpu-operator -n robinio
# wait a few minutes...
kubectl get pods -A
kubectl describe pod nvidia-driver-daemonset-tx5w5 -n gpu-operator-resources

Output:

gpu-operator-resources   nvidia-driver-daemonset-tx5w5                                 0/1     ImagePullBackOff   0          2m55s
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  3m12s                default-scheduler  Successfully assigned gpu-operator-resources/nvidia-driver-daemonset-tx5w5 to florin-rocky-linux
  Normal   Pulled     3m11s                kubelet            Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.1.0" already present on machine
  Normal   Created    3m10s                kubelet            Created container k8s-driver-manager
  Normal   Started    3m10s                kubelet            Started container k8s-driver-manager
  Normal   BackOff    109s (x5 over 3m7s)  kubelet            Back-off pulling image "nvcr.io/nvidia/driver:470.57.02-rocky8.4"
  Warning  Failed     109s (x5 over 3m7s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling    97s (x4 over 3m9s)   kubelet            Pulling image "nvcr.io/nvidia/driver:470.57.02-rocky8.4"
  Warning  Failed     95s (x4 over 3m8s)   kubelet            Failed to pull image "nvcr.io/nvidia/driver:470.57.02-rocky8.4": rpc error: code = Unknown desc = Exception calling application: ErrorUnknown:StatusCode.UNKNOWN:Error response from daemon: manifest for nvcr.io/nvidia/driver:470.57.02-rocky8.4 not found: manifest unknown: manifest unknown
  Warning  Failed     95s (x4 over 3m8s)   kubelet            Error: ErrImagePull

3. Information to attach (optional if deemed irrelevant)

Attached file: info.txt

cdesiniotis commented 3 years ago

We do not support Rocky Linux. Please refer to our platform support page for all the operating systems we currently support: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html#linux-distributions

FlorinAndrei commented 3 years ago

@cdesiniotis A few observations:

You still support CentOS 8. This is a dying OS - it is scheduled for end-of-life in a couple months. Most of its users are already migrating to Rocky Linux, as we do.

The operator code insists on picking its own versions, and thereby it cannot be fixed. For example I tried this:

helm install gpu-operator nvidia/gpu-operator --set driver.version=470.57.02-centos8 -n robinio

...but I got this:

Back-off pulling image "nvcr.io/nvidia/driver:470.57.02-centos8-rocky8.4"
shivamerla commented 3 years ago

@FlorinAndrei for this use case you would need to use a private image(which can be just re-tag of existing centos8 image). ATM there is no plan to continue support for CentOS 8 or Rocky Linux.

FlorinAndrei commented 3 years ago

I retagged your 470.57.02-centos8 driver and pushed it to my repo. I patched your chart to use my repo, then installed it on my Rocky Linux 8.4 cluster. The image was loaded just fine after that. But what I get in the end is this:

NAMESPACE                NAME                                                          READY   STATUS                  RESTARTS   AGE
gpu-operator-resources   gpu-feature-discovery-mqvt5                                   0/1     Init:0/1                0          11m
gpu-operator-resources   nvidia-container-toolkit-daemonset-bl4kg                      1/1     Running                 0          11m
gpu-operator-resources   nvidia-dcgm-exporter-dmvp9                                    0/1     Init:0/1                0          11m
gpu-operator-resources   nvidia-dcgm-wv76q                                             0/1     Init:0/1                0          11m
gpu-operator-resources   nvidia-device-plugin-daemonset-zkm68                          0/1     Init:0/1                0          11m
gpu-operator-resources   nvidia-driver-daemonset-j9xqr                                 1/1     Running                 0          11m
gpu-operator-resources   nvidia-operator-validator-jvb5q                               0/1     Init:CrashLoopBackOff   6          11m
kube-system              calico-kube-controllers-857dfc7bbb-sfq9r                      1/1     Running                 0          42h
kube-system              calico-node-wttld                                             1/1     Running                 0          42h
kube-system              coredns-78545b7666-6465f                                      1/1     Running                 0          42h
kube-system              coredns-78545b7666-psjvz                                      0/1     Pending                 0          42h
kube-system              etcd-florin-rocky-linux                                       1/1     Running                 0          42h
kube-system              kube-apiserver-florin-rocky-linux                             1/1     Running                 0          42h
kube-system              kube-controller-manager-florin-rocky-linux                    1/1     Running                 1          42h
kube-system              kube-multus-ds-amd64-ctcv4                                    1/1     Running                 0          42h
kube-system              kube-proxy-7lmkp                                              1/1     Running                 0          42h
kube-system              kube-scheduler-florin-rocky-linux                             1/1     Running                 1          42h
kube-system              kube-sriov-device-plugin-amd64-j2lll                          1/1     Running                 0          42h
robinio                  csi-attacher-robin-6967696898-xb4jt                           3/3     Running                 0          42h
robinio                  csi-nodeplugin-robin-qrn24                                    3/3     Running                 0          42h
robinio                  csi-provisioner-robin-86b576699-k7gpx                         3/3     Running                 0          42h
robinio                  csi-resizer-robin-77f6745589-mqtqj                            3/3     Running                 0          42h
robinio                  csi-snapshotter-robin-0                                       3/3     Running                 0          42h
robinio                  gpu-operator-599764446c-lzgkj                                 1/1     Running                 0          11m
robinio                  gpu-operator-node-feature-discovery-master-58d884d5cc-7fh8t   1/1     Running                 0          11m
robinio                  gpu-operator-node-feature-discovery-worker-6r6sv              1/1     Running                 0          11m
robinio                  robin-master-hmb44                                            0/1     CrashLoopBackOff        7          42h
robinio                  snapshot-controller-0                                         1/1     Running                 0          42h

And the error from the crashing robin-master pod is:

/bin/bash: relocation error: /var/robinrun/nvidia/driver/lib64/libc.so.6: symbol _dl_fatal_printf, version GLIBC_PRIVATE not defined in file ld-linux-x86-64.so.2 with link time reference

It's a bit puzzling because Rocky Linux 8 and CentOS 8 should be binary compatible.

ATM there is no plan to continue support for CentOS 8 or Rocky Linux.

This is a little concerning for those running k8s in VMs that they manage. CentOS 7 is very old and many users are looking for ways to migrate away from it. CentOS 8 will be EoL in 2 months. Rocky Linux 8 is not supported. Looks like it's end of the road.

FlorinAndrei commented 3 years ago

Trying again, this time with the 450.80.02-centos8 driver instead, and the nvidia-driver-daemonset crashes with:

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 4.18.0-305.19.1.el8_4.x86_64

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 4.18.0-305.19.1.el8_4.x86_64
Installing elfutils...
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string for 4.18.0-305.19.1.el8_4.x86_64...
Compiling NVIDIA driver kernel modules...
Unable to open the file '/lib/modules/4.18.0-305.19.1.el8_4.x86_64/proc/version' (No such file or directory)./usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c: In function '__nv_drm_gem_user_memory_prime_get_sg_table':
/usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:63:48: error: passing argument 1 of 'drm_prime_pages_to_sg' from incompatible pointer type [-Werror=incompatible-pointer-types]
     return drm_prime_pages_to_sg(nv_user_memory->pages,
                                  ~~~~~~~~~~~~~~^~~~~~~
In file included from /usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:28:
./include/drm/drm_prime.h:91:18: note: expected 'struct drm_device *' but argument is of type 'struct page **'
 struct sg_table *drm_prime_pages_to_sg(struct drm_device *dev,
                  ^~~~~~~~~~~~~~~~~~~~~
/usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:64:48: warning: passing argument 2 of 'drm_prime_pages_to_sg' makes pointer from integer without a cast [-Wint-conversion]
                                  nv_user_memory->pages_count);
                                  ~~~~~~~~~~~~~~^~~~~~~~~~~~~
In file included from /usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:28:
./include/drm/drm_prime.h:91:18: note: expected 'struct page **' but argument is of type 'long unsigned int'
 struct sg_table *drm_prime_pages_to_sg(struct drm_device *dev,
                  ^~~~~~~~~~~~~~~~~~~~~
/usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:63:12: error: too few arguments to function 'drm_prime_pages_to_sg'
     return drm_prime_pages_to_sg(nv_user_memory->pages,
            ^~~~~~~~~~~~~~~~~~~~~
In file included from /usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:28:
./include/drm/drm_prime.h:91:18: note: declared here
 struct sg_table *drm_prime_pages_to_sg(struct drm_device *dev,
                  ^~~~~~~~~~~~~~~~~~~~~
/usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:65:1: warning: control reaches end of non-void function [-Wreturn-type]
 }
 ^
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:315: /usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [Makefile:1563: _module_/usr/src/nvidia-450.80.02/kernel] Error 2
make: *** [Makefile:81: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
FlorinAndrei commented 3 years ago

I've installed the driver on the Rocky Linux 8 host, using the instructions for RHEL 8 - some bugs there as well, but I've fixed it.

Then I did this:

helm install gpu-operator nvidia/gpu-operator --set driver.enabled=false -n robinio

I had to create /sbin/ldconfig.real as a symlink to /sbin/ldconfig or else nvidia-operator-validator was crashing.

But now nvidia-dcgm-exporter is crashing, and the error is:

time="2021-09-30T20:03:44Z" level=info msg="Starting dcgm-exporter"
time="2021-09-30T20:03:44Z" level=info msg="Attemping to connect to remote hostengine at 10.128.0.33:5555"
time="2021-09-30T20:03:49Z" level=fatal msg="Error connecting to nv-hostengine: Host engine connection invalid/disconnected"

All pods status:

NAMESPACE                NAME                                                          READY   STATUS             RESTARTS   AGE
gpu-operator-resources   gpu-feature-discovery-tnkt2                                   1/1     Running            0          5m38s
gpu-operator-resources   nvidia-container-toolkit-daemonset-jrmq4                      1/1     Running            0          5m38s
gpu-operator-resources   nvidia-cuda-validator-gwlst                                   0/1     Completed          0          5m32s
gpu-operator-resources   nvidia-dcgm-exporter-jxpss                                    0/1     CrashLoopBackOff   5          5m38s
gpu-operator-resources   nvidia-dcgm-kfqch                                             1/1     Running            0          5m38s
gpu-operator-resources   nvidia-device-plugin-daemonset-2h4tv                          1/1     Running            0          5m38s
gpu-operator-resources   nvidia-device-plugin-validator-xwmp2                          0/1     Completed          0          5m26s
gpu-operator-resources   nvidia-operator-validator-995p6                               1/1     Running            0          5m38s
kube-system              calico-kube-controllers-857dfc7bbb-sfq9r                      1/1     Running            0          44h
kube-system              calico-node-wttld                                             1/1     Running            0          44h
kube-system              coredns-78545b7666-6465f                                      1/1     Running            0          44h
kube-system              coredns-78545b7666-psjvz                                      0/1     Pending            0          44h
kube-system              etcd-florin-rocky-linux                                       1/1     Running            0          44h
kube-system              kube-apiserver-florin-rocky-linux                             1/1     Running            0          43h
kube-system              kube-controller-manager-florin-rocky-linux                    1/1     Running            1          44h
kube-system              kube-multus-ds-amd64-ctcv4                                    1/1     Running            0          44h
kube-system              kube-proxy-7lmkp                                              1/1     Running            0          44h
kube-system              kube-scheduler-florin-rocky-linux                             1/1     Running            1          43h
kube-system              kube-sriov-device-plugin-amd64-j2lll                          1/1     Running            0          44h
robinio                  csi-attacher-robin-6967696898-xb4jt                           3/3     Running            0          43h
robinio                  csi-nodeplugin-robin-qrn24                                    3/3     Running            0          43h
robinio                  csi-provisioner-robin-86b576699-k7gpx                         3/3     Running            0          43h
robinio                  csi-resizer-robin-77f6745589-mqtqj                            3/3     Running            0          43h
robinio                  csi-snapshotter-robin-0                                       3/3     Running            0          43h
robinio                  gpu-operator-599764446c-ftc6d                                 1/1     Running            0          5m59s
robinio                  gpu-operator-node-feature-discovery-master-58d884d5cc-m9v5c   1/1     Running            0          5m59s
robinio                  gpu-operator-node-feature-discovery-worker-rffgk              1/1     Running            0          5m59s
robinio                  robin-master-hmb44                                            1/1     Running            11         44h
robinio                  snapshot-controller-0                                         1/1     Running            0          43h
cdesiniotis commented 3 years ago

Could you provide logs of the dcgm pod? Also, can you try deploying the operator again with dcgm disabled --set dcgm.enabled=false

FlorinAndrei commented 3 years ago

The system got rebooted a few hours ago, not by me. Now all pods are either in Running or Complete. These are the dgcm logs:

# kubectl logs nvidia-dcgm-exporter-jxpss -n gpu-operator-resources
time="2021-10-04T04:56:14Z" level=info msg="Starting dcgm-exporter"
time="2021-10-04T04:56:14Z" level=info msg="Attemping to connect to remote hostengine at 10.128.0.33:5555"
time="2021-10-04T04:56:14Z" level=info msg="DCGM successfully initialized!"
time="2021-10-04T04:56:14Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2021-10-04T04:56:14Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled"
time="2021-10-04T04:56:14Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled"
time="2021-10-04T04:56:14Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled"
time="2021-10-04T04:56:14Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled"
time="2021-10-04T04:56:14Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled"
time="2021-10-04T04:56:15Z" level=info msg="Kubernetes metrics collection enabled!"
time="2021-10-04T04:56:15Z" level=info msg="Pipeline starting"
time="2021-10-04T04:56:15Z" level=info msg="Starting webserver"

What is the hostengine on port 5555? Which component starts it?

cdesiniotis commented 3 years ago

The dcgm pod nvidia-dcgm-kfqch starts it, and the dcgm-exporter pod gets gpu metrics from it and exports them to prometheus

FlorinAndrei commented 3 years ago

I've uninstalled / reinstalled the operator, now the DCGM exporter is again in CrashLoopBackOff, the dcgm pod shows no logs at all:

# kubectl logs nvidia-dcgm-exporter-ld4zj -n gpu-operator-resources
time="2021-10-04T17:22:26Z" level=info msg="Starting dcgm-exporter"
time="2021-10-04T17:22:26Z" level=info msg="Attemping to connect to remote hostengine at 10.128.0.33:5555"
time="2021-10-04T17:22:31Z" level=fatal msg="Error connecting to nv-hostengine: Host engine connection invalid/disconnected"
# kubectl logs nvidia-dcgm-dk9j5 -n gpu-operator-resources
# 
danlenar commented 2 years ago

For anyone using Rocky, this dockerfile should do the trick. Converts the current centos image into rocky

FROM nvcr.io/nvidia/driver:470.82.01-centos8

# Edit yum configs to use vault repos
RUN sed -ri 's|^mirrorlist|#mirrorlist|' /etc/yum.repos.d/*.repo \
    && sed -ri 's|^#baseurl|baseurl|' /etc/yum.repos.d/*.repo \
    && sed -ri 's|^metalink|#metalink|' /etc/yum.repos.d/*.repo \
    && sed -ri 's|mirror.centos.org|vault.centos.org|g' /etc/yum.repos.d/*.repo

# Convert from CentOS to RockyLinux
RUN yum -y install \
        ncurses \
    && curl https://raw.githubusercontent.com/rocky-linux/rocky-tools/main/migrate2rocky/migrate2rocky.sh -o migrate2rocky.sh \
    && chmod +x migrate2rocky.sh \
    && sed -ri '/^efi_check$/d' migrate2rocky.sh \
    && ./migrate2rocky.sh -r \
    && rm migrate2rocky.sh \
    && yum -y remove \
        ncurses \
    && yum clean all

# Add older kernels
RUN set -x; \
    ROCKYLINUX_VERSIONS=($(curl -sL https://dl.rockylinux.org/vault/rocky | egrep -o "8.[0-9]+" | sort -u)); \
    for v in ${ROCKYLINUX_VERSIONS[@]}; do \
        yum-config-manager --add-repo=http://dl.rockylinux.org/vault/rocky/${v}/BaseOS/x86_64/os; \
    done

# Upgrade gcc
RUN yum -y upgrade \
        gcc \
    && yum clean all
erictarrence commented 2 years ago

For anyone using Rocky, this dockerfile should do the trick. Converts the current centos image into rocky

FROM nvcr.io/nvidia/driver:470.82.01-centos8

# Edit yum configs to use vault repos
RUN sed -ri 's|^mirrorlist|#mirrorlist|' /etc/yum.repos.d/*.repo \
    && sed -ri 's|^#baseurl|baseurl|' /etc/yum.repos.d/*.repo \
    && sed -ri 's|^metalink|#metalink|' /etc/yum.repos.d/*.repo \
    && sed -ri 's|mirror.centos.org|vault.centos.org|g' /etc/yum.repos.d/*.repo

# Convert from CentOS to RockyLinux
RUN yum -y install \
        ncurses \
    && curl https://raw.githubusercontent.com/rocky-linux/rocky-tools/main/migrate2rocky/migrate2rocky.sh -o migrate2rocky.sh \
    && chmod +x migrate2rocky.sh \
    && sed -ri '/^efi_check$/d' migrate2rocky.sh \
    && ./migrate2rocky.sh -r \
    && rm migrate2rocky.sh \
    && yum -y remove \
        ncurses \
    && yum clean all

# Add older kernels
RUN set -x; \
    ROCKYLINUX_VERSIONS=($(curl -sL https://dl.rockylinux.org/vault/rocky | egrep -o "8.[0-9]+" | sort -u)); \
    for v in ${ROCKYLINUX_VERSIONS[@]}; do \
        yum-config-manager --add-repo=http://dl.rockylinux.org/vault/rocky/${v}/BaseOS/x86_64/os; \
    done

# Upgrade gcc
RUN yum -y upgrade \
        gcc \
    && yum clean all

According to your script is not successful, the following error is reported my rockylinux version is 8.6


podman build -t nvcr.io/nvidia/driver:470.82.01-rokcy8 .

tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
migrate2rocky - Begin logging at Wed Jun  1 09:17:24 2022.

df: /boot: No such file or directory

Removing dnf cache
Preparing to migrate CentOS Linux 8 to Rocky Linux 8.

warning: /var/cache/dnf/cuda-f1d7a46f058da57c/packages/cuda-compat-11-4-470.129.06-1.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID d42d0685: NOKEY
cuda                                            1.5 MB/s | 1.6 kB     00:00    
The GPG keys listed for the "cuda" repository are already installed but they are not correct for this package.
Check that the correct key URLs are configured for this repository.. Failing package is: cuda-compat-11-4-1:470.129.06-1.x86_64
 GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA
Public key for cuda-cudart-11-4-11.4.148-1.x86_64.rpm is not installed. Failing package is: cuda-cudart-11-4-11.4.148-1.x86_64
 GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA
Public key for cuda-toolkit-11-4-config-common-11.4.148-1.noarch.rpm is not installed. Failing package is: cuda-toolkit-11-4-config-common-11.4.148-1.noarch
 GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA
Public key for cuda-toolkit-11-config-common-11.7.60-1.noarch.rpm is not installed. Failing package is: cuda-toolkit-11-config-common-11.7.60-1.noarch
 GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA
Public key for cuda-toolkit-config-common-11.7.60-1.noarch.rpm is not installed. Failing package is: cuda-toolkit-config-common-11.7.60-1.noarch
 GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA
GPG key at file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA (0x7FA2AF80) is already installed
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Error: GPG check FAILED

Error running pre-update.  Stopping now to avoid putting the system in an
unstable state.  Please correct the issues shown here and try again.

An error occurred while we were attempting to convert your system to Rocky Linux. Your system may be unstable. Script will now exit to prevent possible damage.

 A log of this installation can be found at /var/log/migrate2rocky.log
Error: error building at STEP "RUN yum -y install         ncurses     && curl https://raw.githubusercontent.com/rocky-linux/rocky-tools/main/migrate2rocky/migrate2rocky.sh -o migrate2rocky.sh     && chmod +x migrate2rocky.sh     && sed -ri '/^efi_check$/d' migrate2rocky.sh     && ./migrate2rocky.sh -r     && rm migrate2rocky.sh     && yum -y remove         ncurses     && yum clean all": error while running runtime: exit status 1