Open FlorinAndrei opened 3 years ago
We do not support Rocky Linux. Please refer to our platform support page for all the operating systems we currently support: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html#linux-distributions
@cdesiniotis A few observations:
You still support CentOS 8. This is a dying OS - it is scheduled for end-of-life in a couple months. Most of its users are already migrating to Rocky Linux, as we do.
The operator code insists on picking its own versions, and thereby it cannot be fixed. For example I tried this:
helm install gpu-operator nvidia/gpu-operator --set driver.version=470.57.02-centos8 -n robinio
...but I got this:
Back-off pulling image "nvcr.io/nvidia/driver:470.57.02-centos8-rocky8.4"
@FlorinAndrei for this use case you would need to use a private image(which can be just re-tag of existing centos8 image). ATM there is no plan to continue support for CentOS 8 or Rocky Linux.
I retagged your 470.57.02-centos8 driver and pushed it to my repo. I patched your chart to use my repo, then installed it on my Rocky Linux 8.4 cluster. The image was loaded just fine after that. But what I get in the end is this:
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-operator-resources gpu-feature-discovery-mqvt5 0/1 Init:0/1 0 11m
gpu-operator-resources nvidia-container-toolkit-daemonset-bl4kg 1/1 Running 0 11m
gpu-operator-resources nvidia-dcgm-exporter-dmvp9 0/1 Init:0/1 0 11m
gpu-operator-resources nvidia-dcgm-wv76q 0/1 Init:0/1 0 11m
gpu-operator-resources nvidia-device-plugin-daemonset-zkm68 0/1 Init:0/1 0 11m
gpu-operator-resources nvidia-driver-daemonset-j9xqr 1/1 Running 0 11m
gpu-operator-resources nvidia-operator-validator-jvb5q 0/1 Init:CrashLoopBackOff 6 11m
kube-system calico-kube-controllers-857dfc7bbb-sfq9r 1/1 Running 0 42h
kube-system calico-node-wttld 1/1 Running 0 42h
kube-system coredns-78545b7666-6465f 1/1 Running 0 42h
kube-system coredns-78545b7666-psjvz 0/1 Pending 0 42h
kube-system etcd-florin-rocky-linux 1/1 Running 0 42h
kube-system kube-apiserver-florin-rocky-linux 1/1 Running 0 42h
kube-system kube-controller-manager-florin-rocky-linux 1/1 Running 1 42h
kube-system kube-multus-ds-amd64-ctcv4 1/1 Running 0 42h
kube-system kube-proxy-7lmkp 1/1 Running 0 42h
kube-system kube-scheduler-florin-rocky-linux 1/1 Running 1 42h
kube-system kube-sriov-device-plugin-amd64-j2lll 1/1 Running 0 42h
robinio csi-attacher-robin-6967696898-xb4jt 3/3 Running 0 42h
robinio csi-nodeplugin-robin-qrn24 3/3 Running 0 42h
robinio csi-provisioner-robin-86b576699-k7gpx 3/3 Running 0 42h
robinio csi-resizer-robin-77f6745589-mqtqj 3/3 Running 0 42h
robinio csi-snapshotter-robin-0 3/3 Running 0 42h
robinio gpu-operator-599764446c-lzgkj 1/1 Running 0 11m
robinio gpu-operator-node-feature-discovery-master-58d884d5cc-7fh8t 1/1 Running 0 11m
robinio gpu-operator-node-feature-discovery-worker-6r6sv 1/1 Running 0 11m
robinio robin-master-hmb44 0/1 CrashLoopBackOff 7 42h
robinio snapshot-controller-0 1/1 Running 0 42h
And the error from the crashing robin-master pod is:
/bin/bash: relocation error: /var/robinrun/nvidia/driver/lib64/libc.so.6: symbol _dl_fatal_printf, version GLIBC_PRIVATE not defined in file ld-linux-x86-64.so.2 with link time reference
It's a bit puzzling because Rocky Linux 8 and CentOS 8 should be binary compatible.
ATM there is no plan to continue support for CentOS 8 or Rocky Linux.
This is a little concerning for those running k8s in VMs that they manage. CentOS 7 is very old and many users are looking for ways to migrate away from it. CentOS 8 will be EoL in 2 months. Rocky Linux 8 is not supported. Looks like it's end of the road.
Trying again, this time with the 450.80.02-centos8 driver instead, and the nvidia-driver-daemonset crashes with:
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 4.18.0-305.19.1.el8_4.x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 4.18.0-305.19.1.el8_4.x86_64
Installing elfutils...
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string for 4.18.0-305.19.1.el8_4.x86_64...
Compiling NVIDIA driver kernel modules...
Unable to open the file '/lib/modules/4.18.0-305.19.1.el8_4.x86_64/proc/version' (No such file or directory)./usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c: In function '__nv_drm_gem_user_memory_prime_get_sg_table':
/usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:63:48: error: passing argument 1 of 'drm_prime_pages_to_sg' from incompatible pointer type [-Werror=incompatible-pointer-types]
return drm_prime_pages_to_sg(nv_user_memory->pages,
~~~~~~~~~~~~~~^~~~~~~
In file included from /usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:28:
./include/drm/drm_prime.h:91:18: note: expected 'struct drm_device *' but argument is of type 'struct page **'
struct sg_table *drm_prime_pages_to_sg(struct drm_device *dev,
^~~~~~~~~~~~~~~~~~~~~
/usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:64:48: warning: passing argument 2 of 'drm_prime_pages_to_sg' makes pointer from integer without a cast [-Wint-conversion]
nv_user_memory->pages_count);
~~~~~~~~~~~~~~^~~~~~~~~~~~~
In file included from /usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:28:
./include/drm/drm_prime.h:91:18: note: expected 'struct page **' but argument is of type 'long unsigned int'
struct sg_table *drm_prime_pages_to_sg(struct drm_device *dev,
^~~~~~~~~~~~~~~~~~~~~
/usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:63:12: error: too few arguments to function 'drm_prime_pages_to_sg'
return drm_prime_pages_to_sg(nv_user_memory->pages,
^~~~~~~~~~~~~~~~~~~~~
In file included from /usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:28:
./include/drm/drm_prime.h:91:18: note: declared here
struct sg_table *drm_prime_pages_to_sg(struct drm_device *dev,
^~~~~~~~~~~~~~~~~~~~~
/usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.c:65:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:315: /usr/src/nvidia-450.80.02/kernel/nvidia-drm/nvidia-drm-gem-user-memory.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [Makefile:1563: _module_/usr/src/nvidia-450.80.02/kernel] Error 2
make: *** [Makefile:81: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
I've installed the driver on the Rocky Linux 8 host, using the instructions for RHEL 8 - some bugs there as well, but I've fixed it.
Then I did this:
helm install gpu-operator nvidia/gpu-operator --set driver.enabled=false -n robinio
I had to create /sbin/ldconfig.real as a symlink to /sbin/ldconfig or else nvidia-operator-validator was crashing.
But now nvidia-dcgm-exporter is crashing, and the error is:
time="2021-09-30T20:03:44Z" level=info msg="Starting dcgm-exporter"
time="2021-09-30T20:03:44Z" level=info msg="Attemping to connect to remote hostengine at 10.128.0.33:5555"
time="2021-09-30T20:03:49Z" level=fatal msg="Error connecting to nv-hostengine: Host engine connection invalid/disconnected"
All pods status:
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-operator-resources gpu-feature-discovery-tnkt2 1/1 Running 0 5m38s
gpu-operator-resources nvidia-container-toolkit-daemonset-jrmq4 1/1 Running 0 5m38s
gpu-operator-resources nvidia-cuda-validator-gwlst 0/1 Completed 0 5m32s
gpu-operator-resources nvidia-dcgm-exporter-jxpss 0/1 CrashLoopBackOff 5 5m38s
gpu-operator-resources nvidia-dcgm-kfqch 1/1 Running 0 5m38s
gpu-operator-resources nvidia-device-plugin-daemonset-2h4tv 1/1 Running 0 5m38s
gpu-operator-resources nvidia-device-plugin-validator-xwmp2 0/1 Completed 0 5m26s
gpu-operator-resources nvidia-operator-validator-995p6 1/1 Running 0 5m38s
kube-system calico-kube-controllers-857dfc7bbb-sfq9r 1/1 Running 0 44h
kube-system calico-node-wttld 1/1 Running 0 44h
kube-system coredns-78545b7666-6465f 1/1 Running 0 44h
kube-system coredns-78545b7666-psjvz 0/1 Pending 0 44h
kube-system etcd-florin-rocky-linux 1/1 Running 0 44h
kube-system kube-apiserver-florin-rocky-linux 1/1 Running 0 43h
kube-system kube-controller-manager-florin-rocky-linux 1/1 Running 1 44h
kube-system kube-multus-ds-amd64-ctcv4 1/1 Running 0 44h
kube-system kube-proxy-7lmkp 1/1 Running 0 44h
kube-system kube-scheduler-florin-rocky-linux 1/1 Running 1 43h
kube-system kube-sriov-device-plugin-amd64-j2lll 1/1 Running 0 44h
robinio csi-attacher-robin-6967696898-xb4jt 3/3 Running 0 43h
robinio csi-nodeplugin-robin-qrn24 3/3 Running 0 43h
robinio csi-provisioner-robin-86b576699-k7gpx 3/3 Running 0 43h
robinio csi-resizer-robin-77f6745589-mqtqj 3/3 Running 0 43h
robinio csi-snapshotter-robin-0 3/3 Running 0 43h
robinio gpu-operator-599764446c-ftc6d 1/1 Running 0 5m59s
robinio gpu-operator-node-feature-discovery-master-58d884d5cc-m9v5c 1/1 Running 0 5m59s
robinio gpu-operator-node-feature-discovery-worker-rffgk 1/1 Running 0 5m59s
robinio robin-master-hmb44 1/1 Running 11 44h
robinio snapshot-controller-0 1/1 Running 0 43h
Could you provide logs of the dcgm pod? Also, can you try deploying the operator again with dcgm disabled --set dcgm.enabled=false
The system got rebooted a few hours ago, not by me. Now all pods are either in Running or Complete. These are the dgcm logs:
# kubectl logs nvidia-dcgm-exporter-jxpss -n gpu-operator-resources
time="2021-10-04T04:56:14Z" level=info msg="Starting dcgm-exporter"
time="2021-10-04T04:56:14Z" level=info msg="Attemping to connect to remote hostengine at 10.128.0.33:5555"
time="2021-10-04T04:56:14Z" level=info msg="DCGM successfully initialized!"
time="2021-10-04T04:56:14Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2021-10-04T04:56:14Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled"
time="2021-10-04T04:56:14Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled"
time="2021-10-04T04:56:14Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled"
time="2021-10-04T04:56:14Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled"
time="2021-10-04T04:56:14Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled"
time="2021-10-04T04:56:15Z" level=info msg="Kubernetes metrics collection enabled!"
time="2021-10-04T04:56:15Z" level=info msg="Pipeline starting"
time="2021-10-04T04:56:15Z" level=info msg="Starting webserver"
What is the hostengine on port 5555? Which component starts it?
The dcgm pod nvidia-dcgm-kfqch
starts it, and the dcgm-exporter pod gets gpu metrics from it and exports them to prometheus
I've uninstalled / reinstalled the operator, now the DCGM exporter is again in CrashLoopBackOff, the dcgm pod shows no logs at all:
# kubectl logs nvidia-dcgm-exporter-ld4zj -n gpu-operator-resources
time="2021-10-04T17:22:26Z" level=info msg="Starting dcgm-exporter"
time="2021-10-04T17:22:26Z" level=info msg="Attemping to connect to remote hostengine at 10.128.0.33:5555"
time="2021-10-04T17:22:31Z" level=fatal msg="Error connecting to nv-hostengine: Host engine connection invalid/disconnected"
# kubectl logs nvidia-dcgm-dk9j5 -n gpu-operator-resources
#
For anyone using Rocky, this dockerfile should do the trick. Converts the current centos image into rocky
FROM nvcr.io/nvidia/driver:470.82.01-centos8
# Edit yum configs to use vault repos
RUN sed -ri 's|^mirrorlist|#mirrorlist|' /etc/yum.repos.d/*.repo \
&& sed -ri 's|^#baseurl|baseurl|' /etc/yum.repos.d/*.repo \
&& sed -ri 's|^metalink|#metalink|' /etc/yum.repos.d/*.repo \
&& sed -ri 's|mirror.centos.org|vault.centos.org|g' /etc/yum.repos.d/*.repo
# Convert from CentOS to RockyLinux
RUN yum -y install \
ncurses \
&& curl https://raw.githubusercontent.com/rocky-linux/rocky-tools/main/migrate2rocky/migrate2rocky.sh -o migrate2rocky.sh \
&& chmod +x migrate2rocky.sh \
&& sed -ri '/^efi_check$/d' migrate2rocky.sh \
&& ./migrate2rocky.sh -r \
&& rm migrate2rocky.sh \
&& yum -y remove \
ncurses \
&& yum clean all
# Add older kernels
RUN set -x; \
ROCKYLINUX_VERSIONS=($(curl -sL https://dl.rockylinux.org/vault/rocky | egrep -o "8.[0-9]+" | sort -u)); \
for v in ${ROCKYLINUX_VERSIONS[@]}; do \
yum-config-manager --add-repo=http://dl.rockylinux.org/vault/rocky/${v}/BaseOS/x86_64/os; \
done
# Upgrade gcc
RUN yum -y upgrade \
gcc \
&& yum clean all
For anyone using Rocky, this dockerfile should do the trick. Converts the current centos image into rocky
FROM nvcr.io/nvidia/driver:470.82.01-centos8 # Edit yum configs to use vault repos RUN sed -ri 's|^mirrorlist|#mirrorlist|' /etc/yum.repos.d/*.repo \ && sed -ri 's|^#baseurl|baseurl|' /etc/yum.repos.d/*.repo \ && sed -ri 's|^metalink|#metalink|' /etc/yum.repos.d/*.repo \ && sed -ri 's|mirror.centos.org|vault.centos.org|g' /etc/yum.repos.d/*.repo # Convert from CentOS to RockyLinux RUN yum -y install \ ncurses \ && curl https://raw.githubusercontent.com/rocky-linux/rocky-tools/main/migrate2rocky/migrate2rocky.sh -o migrate2rocky.sh \ && chmod +x migrate2rocky.sh \ && sed -ri '/^efi_check$/d' migrate2rocky.sh \ && ./migrate2rocky.sh -r \ && rm migrate2rocky.sh \ && yum -y remove \ ncurses \ && yum clean all # Add older kernels RUN set -x; \ ROCKYLINUX_VERSIONS=($(curl -sL https://dl.rockylinux.org/vault/rocky | egrep -o "8.[0-9]+" | sort -u)); \ for v in ${ROCKYLINUX_VERSIONS[@]}; do \ yum-config-manager --add-repo=http://dl.rockylinux.org/vault/rocky/${v}/BaseOS/x86_64/os; \ done # Upgrade gcc RUN yum -y upgrade \ gcc \ && yum clean all
According to your script is not successful, the following error is reported my rockylinux version is 8.6
podman build -t nvcr.io/nvidia/driver:470.82.01-rokcy8 .
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
migrate2rocky - Begin logging at Wed Jun 1 09:17:24 2022.
df: /boot: No such file or directory
Removing dnf cache
Preparing to migrate CentOS Linux 8 to Rocky Linux 8.
warning: /var/cache/dnf/cuda-f1d7a46f058da57c/packages/cuda-compat-11-4-470.129.06-1.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID d42d0685: NOKEY
cuda 1.5 MB/s | 1.6 kB 00:00
The GPG keys listed for the "cuda" repository are already installed but they are not correct for this package.
Check that the correct key URLs are configured for this repository.. Failing package is: cuda-compat-11-4-1:470.129.06-1.x86_64
GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA
Public key for cuda-cudart-11-4-11.4.148-1.x86_64.rpm is not installed. Failing package is: cuda-cudart-11-4-11.4.148-1.x86_64
GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA
Public key for cuda-toolkit-11-4-config-common-11.4.148-1.noarch.rpm is not installed. Failing package is: cuda-toolkit-11-4-config-common-11.4.148-1.noarch
GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA
Public key for cuda-toolkit-11-config-common-11.7.60-1.noarch.rpm is not installed. Failing package is: cuda-toolkit-11-config-common-11.7.60-1.noarch
GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA
Public key for cuda-toolkit-config-common-11.7.60-1.noarch.rpm is not installed. Failing package is: cuda-toolkit-config-common-11.7.60-1.noarch
GPG Keys are configured as: file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA
GPG key at file:///etc/pki/rpm-gpg/RPM-GPG-KEY-NVIDIA (0x7FA2AF80) is already installed
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Error: GPG check FAILED
Error running pre-update. Stopping now to avoid putting the system in an
unstable state. Please correct the issues shown here and try again.
An error occurred while we were attempting to convert your system to Rocky Linux. Your system may be unstable. Script will now exit to prevent possible damage.
A log of this installation can be found at /var/log/migrate2rocky.log
Error: error building at STEP "RUN yum -y install ncurses && curl https://raw.githubusercontent.com/rocky-linux/rocky-tools/main/migrate2rocky/migrate2rocky.sh -o migrate2rocky.sh && chmod +x migrate2rocky.sh && sed -ri '/^efi_check$/d' migrate2rocky.sh && ./migrate2rocky.sh -r && rm migrate2rocky.sh && yum -y remove ncurses && yum clean all": error while running runtime: exit status 1
CentOS 8 is going end-of-life very soon. We are migrating to Rocky Linux 8 instead for the k8s hosts. I am trying to install the GPU operator on Rocky Linux 8.4 but one of the components fails.
Kubernetes 1.20.5
Docker 20.10.3
1. Issue or feature description
Output:
3. Information to attach (optional if deemed irrelevant)
kubectl get pods --all-namespaces
kubectl get ds --all-namespaces
kubectl describe pod -n NAMESPACE POD_NAME
kubectl logs -n NAMESPACE POD_NAME
Attached file: info.txt