Open lgruendh opened 2 months ago
Hello,
did I open this issue correctly? Is there any update?
Regards, Lukas
Hi Lukas,
Can you clarify where you are seeing the repos as being enabled? Is it within the running container image or on the build host?
--Alex.
Hi Alex,
the repos on the running container are disabled.
Regards
To clarify even more...
I created the container image without any problems.
Afterwards we installed the gpu operator with the helm chart below.
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
name: gpu-operator
namespace: helminstalls
spec:
chart: gpu-operator
repo: https://helm.ngc.nvidia.com/nvidia
targetNamespace: gpu-operator
valuesContent: |-
operator:
defaultRuntime: containerd
driver:
repository: my.repo.url/docker-local/nvidia
version: 550.54.14
toolkit:
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
Lots of containers get created and for gpu nodes a nvidia-driver-daemonset with the earlier generated image gets created. Only the repository SLE_BCI is available and does not have the right kernel version to install.
nvidia-driver-daemonset-788lq:/drivers # zypper repos
Refreshing service 'container-suseconnect-zypp'.
Repository priorities are without effect. All enabled repositories share the same priority.
# | Alias | Name | Enabled | GPG Check | Refresh
--+----------------+----------------+---------+-----------+--------
1 | SLE_BCI | SLE_BCI | Yes | (r ) Yes | Yes
2 | SLE_BCI_debug | SLE_BCI_debug | No | ---- | ----
3 | SLE_BCI_source | SLE_BCI_source | No | ---- | ----
Also in the nvidia-driver script, the sles kernel version to install would resolve to 5.14.21-150500.33.51 (a ".1" is missing in the end)
Here is an example if I enable the repositories and search for kernel.
nvidia-driver-daemonset-788lq:/drivers # zypper mr -e 2
Repository 'SLE_BCI_debug' has been successfully enabled.
nvidia-driver-daemonset-788lq:/drivers # zypper mr -e 3
Repository 'SLE_BCI_source' has been successfully enabled.
nvidia-driver-daemonset-788lq:/drivers # zypper se -s kernel
Refreshing service 'container-suseconnect-zypp'.
Retrieving repository 'SLE_BCI_source' metadata ..................................................................................................................................................................................................................[done]
Building repository 'SLE_BCI_source' cache .......................................................................................................................................................................................................................[done]
Loading repository data...
Reading installed packages...
S | Name | Type | Version | Arch | Repository
--+--------------------------------+------------+-------------------------------+--------+---------------
| kernel-azure | srcpackage | 5.14.21-150500.33.51.1 | noarch | SLE_BCI_source
| kernel-azure-debuginfo | package | 5.14.21-150500.33.51.1 | x86_64 | SLE_BCI_debug
| kernel-azure-debugsource | package | 5.14.21-150500.33.51.1 | x86_64 | SLE_BCI_debug
| kernel-azure-devel | package | 5.14.21-150500.33.51.1 | x86_64 | SLE_BCI
| kernel-azure-devel-debuginfo | package | 5.14.21-150500.33.51.1 | x86_64 | SLE_BCI_debug
| kernel-default | srcpackage | 5.14.21-150500.55.62.2 | noarch | SLE_BCI_source
| kernel-default-debuginfo | package | 5.14.21-150500.55.62.2 | x86_64 | SLE_BCI_debug
| kernel-default-debugsource | package | 5.14.21-150500.55.62.2 | x86_64 | SLE_BCI_debug
| kernel-default-devel | package | 5.14.21-150500.55.62.2 | x86_64 | SLE_BCI
| kernel-default-devel-debuginfo | package | 5.14.21-150500.55.62.2 | x86_64 | SLE_BCI_debug
| kernel-devel | package | 5.14.21-150500.55.62.2 | noarch | SLE_BCI
| kernel-devel-azure | package | 5.14.21-150500.33.51.1 | noarch | SLE_BCI
| kernel-macros | package | 5.14.21-150500.55.62.2 | noarch | SLE_BCI
| kernel-source | srcpackage | 5.14.21-150500.55.62.2 | noarch | SLE_BCI_source
| kernel-source-azure | srcpackage | 5.14.21-150500.33.51.1 | noarch | SLE_BCI_source
| kernel-syms | package | 5.14.21-150500.55.62.1 | x86_64 | SLE_BCI
| kernel-syms | srcpackage | 5.14.21-150500.55.62.1 | noarch | SLE_BCI_source
| kernel-syms-azure | package | 5.14.21-150500.33.51.1 | x86_64 | SLE_BCI
| kernel-syms-azure | srcpackage | 5.14.21-150500.33.51.1 | noarch | SLE_BCI_source
| kernelshark | package | 2.6.1-2.37 | x86_64 | SLE_BCI
| kernelshark-debuginfo | package | 2.6.1-2.37 | x86_64 | SLE_BCI_debug
| nfs-kernel-server | package | 2.1.1-150500.22.3.1 | x86_64 | SLE_BCI
| nfs-kernel-server-debuginfo | package | 2.1.1-150500.22.3.1 | x86_64 | SLE_BCI_debug
| purge-kernels-service | package | 0-150200.8.6.1 | noarch | SLE_BCI
| purge-kernels-service | srcpackage | 0-150200.8.6.1 | noarch | SLE_BCI_source
| texlive-l3kernel | package | 2021.189.svn57789-150400.17.1 | noarch | SLE_BCI
Hopefully this information is helping you.
If you need any more information, feel free to ask.
Regards, Lukas
Hi Lukas,
I think I finally understand your query. In the case of the running container, I haven't looked to see which repos are enabled, but I'm not surprised at your findings. In all honesty, we use a container image that is hugely overpowered for the job. In the future we are hoping Nvidia would like to work this as a cooperative project rather than our unique work supporting their unique work. The real goal here would be to have the most minimal, supported container image needed. Thus, the number of repos needd for it would also be minimal.
As well, I think you've solved another mystery for us (Thank you!). We have had a small number of complaints that the driver container wouldn't build correctly due to a kernel incompatibility. We used a build variable to resolve it, which it often did, except for a few cases. I think your discovery in the Nvidia driver script might be the reason for the outliers.
Would you be mind offering a pointer into where in that script it's calling out the wrong kernel version? We're working on a few things right now and it might be a bit before we can really dig into this.
Thanks in advance, and again!
As a side note, I'm curious if you're building on Azure as the kernel version you referenced is from a source package called kernel-azure.
We've only done this work on-premises and thus used the kernel-default.
For me it is also kernel-default because I'm using this on-premise.
Before building the image...
git clone https://gitlab.com/nvidia/container-images/driver/ && cd driver/sle15
We are talking about the nvidia-driver
script.
The function _resolve_kernel_version
can not find the right package.
The variable version_without_flavor
gets resolved to 5.14.21-150500.55.52
zypper -x se -s -t package --match-exact "kernel-devel"
<?xml version='1.0'?>
<stream>
<message type="info">Refreshing service 'container-suseconnect-zypp'.</message>
<progress id="raw-refresh" name="Retrieving repository 'SLE_BCI' metadata" value="0"/>
<progress id="raw-refresh" name="Retrieving repository 'SLE_BCI' metadata"/>
<progress id="raw-refresh" name="Retrieving repository 'SLE_BCI' metadata"/>
<progress id="raw-refresh" name="Retrieving repository 'SLE_BCI' metadata"/>
<progress id="raw-refresh" name="Retrieving repository 'SLE_BCI' metadata"/>
<progress id="raw-refresh" name="Retrieving repository 'SLE_BCI' metadata"/>
<progress id="raw-refresh" name="Retrieving repository 'SLE_BCI' metadata"/>
<progress id="raw-refresh" name="Retrieving repository 'SLE_BCI' metadata"/>
<progress id="raw-refresh" name="Retrieving repository 'SLE_BCI' metadata"/>
<progress id="raw-refresh" name="Retrieving repository 'SLE_BCI' metadata"/>
<progress id="raw-refresh" name="Retrieving repository 'SLE_BCI' metadata"/>
<progress id="raw-refresh" name="Retrieving repository 'SLE_BCI' metadata" done="1"/>
<progress id="10" name="Building repository 'SLE_BCI' cache"/>
<progress id="10" name="Building repository 'SLE_BCI' cache" value="0"/>
<progress id="10" name="Building repository 'SLE_BCI' cache" value="100"/>
<progress id="10" name="Building repository 'SLE_BCI' cache" value="100"/>
<progress id="10" name="Building repository 'SLE_BCI' cache" done="1"/>
<message type="info">Loading repository data...</message>
<message type="info">Reading installed packages...</message>
<search-result version="0.0">
<solvable-list>
<solvable status="not-installed" name="kernel-devel" kind="package" edition="5.14.21-150500.55.62.2" arch="noarch" repository="SLE_BCI"/>
</solvable-list>
</search-result>
</stream>
Line 47 contains sed -e 's/.*edition="\([^"]*\).*/\1/g;s/\(.*\)\..*/\1/')
which trims the last part of the kernel version (in this case .2), this needs to be rewritten.
Function _install_prerequisites
wants to install 2 packages in line 69
if ! zypper --non-interactive in -y --no-recommends --capability kernel-${FLAVOR} = ${version_without_flavor} kernel-${FLAVOR}-devel = ${version_without_flavor} ; then echo "FATAL: failed to install kernel packages. Ensure SLES subscription is available." exit 1 fi
These get resolved to kernel-default
and kernel-default-devel
with the earlier resolved version.
Because the version gets trimmed by the sed
no packages can be found.
Also kernel-default
is not availably inside SLE_BCI
.
Hopefully this provides the information needed to update the installation guide. Feel free to ask if some information is missing.
That's extremely helpful. Thank you very much. We'll hopefully be able to begin addressing this in the next week or two. Overall, I think the structure of the document needs to be changed. I think the idea of a separate build host now causes more problems than it solves. As well, we need to figure out the issue around the kernel-default package.
Hi Lukas, We're about half way through the re-write (assuming the rest goes to plan) and here are the repos that are configured in the running driver container:
# | Alias | Name | Enabled | GPG Check | Refresh ---+-----------------------------------------------------------------------------------+-------+---------+-----------+-------- 1 | SLE_BCI | SLE-> | Yes | (r ) Yes | Yes 2 | SLE_BCI_debug | SLE-> | No | ---- | ---- 3 | SLE_BCI_source | SLE-> | No | ---- | ---- 4 | container-suseconnect-zypp:SLE-Module-Basesystem15-SP5-Debuginfo-Pool | SLE-> | No | ---- | ---- 5 | container-suseconnect-zypp:SLE-Module-Basesystem15-SP5-Debuginfo-Updates | SLE-> | No | ---- | ---- 6 | container-suseconnect-zypp:SLE-Module-Basesystem15-SP5-Pool | SLE-> | Yes | (r ) Yes | No 7 | container-suseconnect-zypp:SLE-Module-Basesystem15-SP5-Source-Pool | SLE-> | No | ---- | ---- 8 | container-suseconnect-zypp:SLE-Module-Basesystem15-SP5-Updates | SLE-> | Yes | (r ) Yes | Yes 9 | container-suseconnect-zypp:SLE-Module-Python3-15-SP5-Debuginfo-Pool | SLE-> | No | ---- | ---- 10 | container-suseconnect-zypp:SLE-Module-Python3-15-SP5-Debuginfo-Updates | SLE-> | No | ---- | ---- 11 | container-suseconnect-zypp:SLE-Module-Python3-15-SP5-Pool | SLE-> | Yes | (r ) Yes | No 12 | container-suseconnect-zypp:SLE-Module-Python3-15-SP5-Source-Pool | SLE-> | No | ---- | ---- 13 | container-suseconnect-zypp:SLE-Module-Python3-15-SP5-Updates | SLE-> | Yes | (r ) Yes | Yes 14 | container-suseconnect-zypp:SLE-Module-Server-Applications15-SP5-Debuginfo-Pool | SLE-> | No | ---- | ---- 15 | container-suseconnect-zypp:SLE-Module-Server-Applications15-SP5-Debuginfo-Updates | SLE-> | No | ---- | ---- 16 | container-suseconnect-zypp:SLE-Module-Server-Applications15-SP5-Pool | SLE-> | Yes | (r ) Yes | No 17 | container-suseconnect-zypp:SLE-Module-Server-Applications15-SP5-Source-Pool | SLE-> | No | ---- | ---- 18 | container-suseconnect-zypp:SLE-Module-Server-Applications15-SP5-Updates | SLE-> | Yes | (r ) Yes | Yes 19 | container-suseconnect-zypp:SLE-Product-SLES15-SP5-Debuginfo-Pool | SLE-> | No | ---- | ---- 20 | container-suseconnect-zypp:SLE-Product-SLES15-SP5-Debuginfo-Updates | SLE-> | No | ---- | ---- 21 | container-suseconnect-zypp:SLE-Product-SLES15-SP5-Pool | SLE-> | Yes | (r ) Yes | No 22 | container-suseconnect-zypp:SLE-Product-SLES15-SP5-Source-Pool | SLE-> | No | ---- | ---- 23 | container-suseconnect-zypp:SLE-Product-SLES15-SP5-Updates | SLE-> | Yes | (r ) Yes | Yes 24 | container-suseconnect-zypp:SLE15-SP5-Installer-Updates | SLE-> | No | ---- | ----
Looking good, thank you for your help!
Hi Lukas, I've made some progress, but the fixes I'm applying aren't the same as what you discovered. What I've found is that the variable KERNEL_VERSION in the nvidia-driver script gets manipulated several times over and comes out with the wrong value for the containing function.
For example, in the _resolve_kernel_version() function, it is set to "5.14.21-150500.55.65 5.14.21". Note that the second value shouldn't be there.
As well, in the function _install_prerequisites() , it is set to "5.14.21-150500.55.65 5.14.21-150500.55.65-default". In this case only the second value should be there.
There are more examples but I've taken it as far as I can. I've tried working through a bunch of make errors but I'm starting to spend too much time on it. Nvidia is going to have to fix that script themselves. I'm going to experiment on a different host that was working previously and then file an issue with them.
Hi Lukas, I think we've finally got a handle on what is causing this. We've discovered that the behavior exists with kernel version 5.14.21-150500.55.65.1, but not with version 5.14.21-150500.55.62.2. This fully reproducible, however, I see that you had an issue with version 5.14.21-150500.55.62.2. We did run into an issue where the SUSE subscriptions on a host had not been enabled correctly. This didn't seem to cause problems for the host during normal operations, but kept the container image from running. We cleaned up the registration, as per this doc (https://www.suse.com/support/kb/doc/?id=000019054), and that resolved the problems.
We currently believe the issue is in the way the nvidia-driver script is parsing data, though I'm at a loss at to why it can parse the same string ending in "2.2", but not "5.1". It's possibly pulling data from something else that may have changed format with the last kernel change. I'm going to file an issue against the script.
We are using Suse Manager, that is why no SUSE subscriptions is enabled. Thank you for your help so far.
I've created this issue on the Nvidia GitLab repo: https://gitlab.com/nvidia/container-images/driver/-/issues/52
Hi Lukas,
I've updated the upstream document to include the workaround I've provided in the GitLab issue.
Feel free to test out the whole doc (it now leverages Rancher for some of the command line stuff) or just the fix, which is item 3.a. in the section "Building the container image".
NVIDIA GPU Driver and NVIDIA GPU Operator with SUSE:
https://documentation.suse.com/trd/kubernetes/html/gs_rke2-slebci_nvidia-gpu-operator/index.html#
Hello,
I'm trying to create a container image for the nvidia gpu operator.
Code:
Once the container image gets deployed only the repository SLE_BCI is enabled and does not contain any kernel-default packages. Once I enable the other repositories, the kernel-default is available, but only as src-package.
Also a ".1" is missing in the version string to install the package.
Can you please update the documentation, or the base container image?
Regards!
EDIT: I just changed the format of code to code ^^