SUSE / technical-reference-documentation

SUSE Technical Reference Documentation
https://documentation.suse.com/trd-supported.html
6 stars 20 forks source link

[doc] Issue in "NVIDIA GPU Driver and NVIDIA GPU Operator with SUSE" #108

Open lgruendh opened 2 months ago

lgruendh commented 2 months ago

NVIDIA GPU Driver and NVIDIA GPU Operator with SUSE:

https://documentation.suse.com/trd/kubernetes/html/gs_rke2-slebci_nvidia-gpu-operator/index.html#

Hello,

I'm trying to create a container image for the nvidia gpu operator.

Code:

cat <<EOF> /tmp/build-variables.sh
export REGISTRY="my.repo.url.com/docker-local/nvidia"
export SLE_VERSION="15"
export SLE_SP="5"
export DRIVER_VERSION="550.54.14"
export OPERATOR_VERSION="v23.9.0"
export CUDA_VERSION="12.4.1"
EOF

source /tmp/build-variables.sh

git clone https://gitlab.com/nvidia/container-images/driver/ && cd driver/sle15

sed -i "/^FROM/ s/golang\:1\.../golang\:1.22/" Dockerfile

sed -i '/^FROM/ s/suse\/sle15/bci\/bci-base/' Dockerfile

sudo podman build -t \
${REGISTRY}/nvidia-sle${SLE_VERSION}sp${SLE_SP}-${DRIVER_VERSION}:${DRIVER_VERSION} \
  --build-arg SLES_VERSION="${SLE_VERSION}.${SLE_SP}" \
  --build-arg DRIVER_ARCH="x86_64" \
  --build-arg DRIVER_VERSION="${DRIVER_VERSION}" \
  --build-arg CUDA_VERSION="${CUDA_VERSION}" \
  --build-arg PRIVATE_KEY=empty  \
.

Once the container image gets deployed only the repository SLE_BCI is enabled and does not contain any kernel-default packages. Once I enable the other repositories, the kernel-default is available, but only as src-package.

Also a ".1" is missing in the version string to install the package.

Can you please update the documentation, or the base container image?

Regards!

EDIT: I just changed the format of code to code ^^

lgruendh commented 1 month ago

Hello,

did I open this issue correctly? Is there any update?

Regards, Lukas

alexarnoldy commented 1 month ago

Hi Lukas,

Can you clarify where you are seeing the repos as being enabled? Is it within the running container image or on the build host?

--Alex.

lgruendh commented 1 month ago

Hi Alex,

the repos on the running container are disabled.

Regards

lgruendh commented 1 month ago

To clarify even more...

I created the container image without any problems.

Afterwards we installed the gpu operator with the helm chart below.

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: gpu-operator
  namespace: helminstalls
spec:
  chart: gpu-operator
  repo: https://helm.ngc.nvidia.com/nvidia
  targetNamespace: gpu-operator
  valuesContent: |-
    operator:
      defaultRuntime: containerd
    driver:
      repository: my.repo.url/docker-local/nvidia
      version: 550.54.14
    toolkit:
      env:
        - name: CONTAINERD_CONFIG
          value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml
        - name: CONTAINERD_SOCKET
          value: /run/k3s/containerd/containerd.sock
        - name: CONTAINERD_RUNTIME_CLASS
          value: nvidia
        - name: CONTAINERD_SET_AS_DEFAULT
          value: "true"

Lots of containers get created and for gpu nodes a nvidia-driver-daemonset with the earlier generated image gets created. Only the repository SLE_BCI is available and does not have the right kernel version to install.

nvidia-driver-daemonset-788lq:/drivers # zypper repos
Refreshing service 'container-suseconnect-zypp'.
Repository priorities are without effect. All enabled repositories share the same priority.

# | Alias          | Name           | Enabled | GPG Check | Refresh
--+----------------+----------------+---------+-----------+--------
1 | SLE_BCI        | SLE_BCI        | Yes     | (r ) Yes  | Yes
2 | SLE_BCI_debug  | SLE_BCI_debug  | No      | ----      | ----
3 | SLE_BCI_source | SLE_BCI_source | No      | ----      | ----

Also in the nvidia-driver script, the sles kernel version to install would resolve to 5.14.21-150500.33.51 (a ".1" is missing in the end)

Here is an example if I enable the repositories and search for kernel.

nvidia-driver-daemonset-788lq:/drivers # zypper mr -e 2
Repository 'SLE_BCI_debug' has been successfully enabled.
nvidia-driver-daemonset-788lq:/drivers # zypper mr -e 3
Repository 'SLE_BCI_source' has been successfully enabled.
nvidia-driver-daemonset-788lq:/drivers # zypper se -s kernel
Refreshing service 'container-suseconnect-zypp'.
Retrieving repository 'SLE_BCI_source' metadata ..................................................................................................................................................................................................................[done]
Building repository 'SLE_BCI_source' cache .......................................................................................................................................................................................................................[done]
Loading repository data...
Reading installed packages...

S | Name                           | Type       | Version                       | Arch   | Repository
--+--------------------------------+------------+-------------------------------+--------+---------------
  | kernel-azure                   | srcpackage | 5.14.21-150500.33.51.1        | noarch | SLE_BCI_source
  | kernel-azure-debuginfo         | package    | 5.14.21-150500.33.51.1        | x86_64 | SLE_BCI_debug
  | kernel-azure-debugsource       | package    | 5.14.21-150500.33.51.1        | x86_64 | SLE_BCI_debug
  | kernel-azure-devel             | package    | 5.14.21-150500.33.51.1        | x86_64 | SLE_BCI
  | kernel-azure-devel-debuginfo   | package    | 5.14.21-150500.33.51.1        | x86_64 | SLE_BCI_debug
  | kernel-default                 | srcpackage | 5.14.21-150500.55.62.2        | noarch | SLE_BCI_source
  | kernel-default-debuginfo       | package    | 5.14.21-150500.55.62.2        | x86_64 | SLE_BCI_debug
  | kernel-default-debugsource     | package    | 5.14.21-150500.55.62.2        | x86_64 | SLE_BCI_debug
  | kernel-default-devel           | package    | 5.14.21-150500.55.62.2        | x86_64 | SLE_BCI
  | kernel-default-devel-debuginfo | package    | 5.14.21-150500.55.62.2        | x86_64 | SLE_BCI_debug
  | kernel-devel                   | package    | 5.14.21-150500.55.62.2        | noarch | SLE_BCI
  | kernel-devel-azure             | package    | 5.14.21-150500.33.51.1        | noarch | SLE_BCI
  | kernel-macros                  | package    | 5.14.21-150500.55.62.2        | noarch | SLE_BCI
  | kernel-source                  | srcpackage | 5.14.21-150500.55.62.2        | noarch | SLE_BCI_source
  | kernel-source-azure            | srcpackage | 5.14.21-150500.33.51.1        | noarch | SLE_BCI_source
  | kernel-syms                    | package    | 5.14.21-150500.55.62.1        | x86_64 | SLE_BCI
  | kernel-syms                    | srcpackage | 5.14.21-150500.55.62.1        | noarch | SLE_BCI_source
  | kernel-syms-azure              | package    | 5.14.21-150500.33.51.1        | x86_64 | SLE_BCI
  | kernel-syms-azure              | srcpackage | 5.14.21-150500.33.51.1        | noarch | SLE_BCI_source
  | kernelshark                    | package    | 2.6.1-2.37                    | x86_64 | SLE_BCI
  | kernelshark-debuginfo          | package    | 2.6.1-2.37                    | x86_64 | SLE_BCI_debug
  | nfs-kernel-server              | package    | 2.1.1-150500.22.3.1           | x86_64 | SLE_BCI
  | nfs-kernel-server-debuginfo    | package    | 2.1.1-150500.22.3.1           | x86_64 | SLE_BCI_debug
  | purge-kernels-service          | package    | 0-150200.8.6.1                | noarch | SLE_BCI
  | purge-kernels-service          | srcpackage | 0-150200.8.6.1                | noarch | SLE_BCI_source
  | texlive-l3kernel               | package    | 2021.189.svn57789-150400.17.1 | noarch | SLE_BCI

Hopefully this information is helping you.

If you need any more information, feel free to ask.

Regards, Lukas

alexarnoldy commented 1 month ago

Hi Lukas,

I think I finally understand your query. In the case of the running container, I haven't looked to see which repos are enabled, but I'm not surprised at your findings. In all honesty, we use a container image that is hugely overpowered for the job. In the future we are hoping Nvidia would like to work this as a cooperative project rather than our unique work supporting their unique work. The real goal here would be to have the most minimal, supported container image needed. Thus, the number of repos needd for it would also be minimal.

As well, I think you've solved another mystery for us (Thank you!). We have had a small number of complaints that the driver container wouldn't build correctly due to a kernel incompatibility. We used a build variable to resolve it, which it often did, except for a few cases. I think your discovery in the Nvidia driver script might be the reason for the outliers.

Would you be mind offering a pointer into where in that script it's calling out the wrong kernel version? We're working on a few things right now and it might be a bit before we can really dig into this.

Thanks in advance, and again!

alexarnoldy commented 1 month ago

As a side note, I'm curious if you're building on Azure as the kernel version you referenced is from a source package called kernel-azure.

We've only done this work on-premises and thus used the kernel-default.

lgruendh commented 1 month ago

For me it is also kernel-default because I'm using this on-premise.

Before building the image... git clone https://gitlab.com/nvidia/container-images/driver/ && cd driver/sle15 We are talking about the nvidia-driver script.

The function _resolve_kernel_version can not find the right package. The variable version_without_flavor gets resolved to 5.14.21-150500.55.52

zypper -x se -s -t package --match-exact "kernel-devel"
<?xml version='1.0'?>
<stream>
<message type="info">Refreshing service &apos;container-suseconnect-zypp&apos;.</message>
<progress id="raw-refresh" name="Retrieving repository &apos;SLE_BCI&apos; metadata" value="0"/>
<progress id="raw-refresh" name="Retrieving repository &apos;SLE_BCI&apos; metadata"/>
<progress id="raw-refresh" name="Retrieving repository &apos;SLE_BCI&apos; metadata"/>
<progress id="raw-refresh" name="Retrieving repository &apos;SLE_BCI&apos; metadata"/>
<progress id="raw-refresh" name="Retrieving repository &apos;SLE_BCI&apos; metadata"/>
<progress id="raw-refresh" name="Retrieving repository &apos;SLE_BCI&apos; metadata"/>
<progress id="raw-refresh" name="Retrieving repository &apos;SLE_BCI&apos; metadata"/>
<progress id="raw-refresh" name="Retrieving repository &apos;SLE_BCI&apos; metadata"/>
<progress id="raw-refresh" name="Retrieving repository &apos;SLE_BCI&apos; metadata"/>
<progress id="raw-refresh" name="Retrieving repository &apos;SLE_BCI&apos; metadata"/>
<progress id="raw-refresh" name="Retrieving repository &apos;SLE_BCI&apos; metadata"/>
<progress id="raw-refresh" name="Retrieving repository &apos;SLE_BCI&apos; metadata" done="1"/>
<progress id="10" name="Building repository &apos;SLE_BCI&apos; cache"/>
<progress id="10" name="Building repository &apos;SLE_BCI&apos; cache" value="0"/>
<progress id="10" name="Building repository &apos;SLE_BCI&apos; cache" value="100"/>
<progress id="10" name="Building repository &apos;SLE_BCI&apos; cache" value="100"/>
<progress id="10" name="Building repository &apos;SLE_BCI&apos; cache" done="1"/>
<message type="info">Loading repository data...</message>
<message type="info">Reading installed packages...</message>

<search-result version="0.0">
<solvable-list>
<solvable status="not-installed" name="kernel-devel" kind="package" edition="5.14.21-150500.55.62.2" arch="noarch" repository="SLE_BCI"/>
</solvable-list>
</search-result>
</stream>

Line 47 contains sed -e 's/.*edition="\([^"]*\).*/\1/g;s/\(.*\)\..*/\1/') which trims the last part of the kernel version (in this case .2), this needs to be rewritten.

Function _install_prerequisites wants to install 2 packages in line 69 if ! zypper --non-interactive in -y --no-recommends --capability kernel-${FLAVOR} = ${version_without_flavor} kernel-${FLAVOR}-devel = ${version_without_flavor} ; then echo "FATAL: failed to install kernel packages. Ensure SLES subscription is available." exit 1 fi

These get resolved to kernel-default and kernel-default-devel with the earlier resolved version. Because the version gets trimmed by the sed no packages can be found. Also kernel-default is not availably inside SLE_BCI.

Hopefully this provides the information needed to update the installation guide. Feel free to ask if some information is missing.

alexarnoldy commented 3 weeks ago

That's extremely helpful. Thank you very much. We'll hopefully be able to begin addressing this in the next week or two. Overall, I think the structure of the document needs to be changed. I think the idea of a separate build host now causes more problems than it solves. As well, we need to figure out the issue around the kernel-default package.

alexarnoldy commented 3 weeks ago

Hi Lukas, We're about half way through the re-write (assuming the rest goes to plan) and here are the repos that are configured in the running driver container:

# | Alias | Name | Enabled | GPG Check | Refresh ---+-----------------------------------------------------------------------------------+-------+---------+-----------+-------- 1 | SLE_BCI | SLE-> | Yes | (r ) Yes | Yes 2 | SLE_BCI_debug | SLE-> | No | ---- | ---- 3 | SLE_BCI_source | SLE-> | No | ---- | ---- 4 | container-suseconnect-zypp:SLE-Module-Basesystem15-SP5-Debuginfo-Pool | SLE-> | No | ---- | ---- 5 | container-suseconnect-zypp:SLE-Module-Basesystem15-SP5-Debuginfo-Updates | SLE-> | No | ---- | ---- 6 | container-suseconnect-zypp:SLE-Module-Basesystem15-SP5-Pool | SLE-> | Yes | (r ) Yes | No 7 | container-suseconnect-zypp:SLE-Module-Basesystem15-SP5-Source-Pool | SLE-> | No | ---- | ---- 8 | container-suseconnect-zypp:SLE-Module-Basesystem15-SP5-Updates | SLE-> | Yes | (r ) Yes | Yes 9 | container-suseconnect-zypp:SLE-Module-Python3-15-SP5-Debuginfo-Pool | SLE-> | No | ---- | ---- 10 | container-suseconnect-zypp:SLE-Module-Python3-15-SP5-Debuginfo-Updates | SLE-> | No | ---- | ---- 11 | container-suseconnect-zypp:SLE-Module-Python3-15-SP5-Pool | SLE-> | Yes | (r ) Yes | No 12 | container-suseconnect-zypp:SLE-Module-Python3-15-SP5-Source-Pool | SLE-> | No | ---- | ---- 13 | container-suseconnect-zypp:SLE-Module-Python3-15-SP5-Updates | SLE-> | Yes | (r ) Yes | Yes 14 | container-suseconnect-zypp:SLE-Module-Server-Applications15-SP5-Debuginfo-Pool | SLE-> | No | ---- | ---- 15 | container-suseconnect-zypp:SLE-Module-Server-Applications15-SP5-Debuginfo-Updates | SLE-> | No | ---- | ---- 16 | container-suseconnect-zypp:SLE-Module-Server-Applications15-SP5-Pool | SLE-> | Yes | (r ) Yes | No 17 | container-suseconnect-zypp:SLE-Module-Server-Applications15-SP5-Source-Pool | SLE-> | No | ---- | ---- 18 | container-suseconnect-zypp:SLE-Module-Server-Applications15-SP5-Updates | SLE-> | Yes | (r ) Yes | Yes 19 | container-suseconnect-zypp:SLE-Product-SLES15-SP5-Debuginfo-Pool | SLE-> | No | ---- | ---- 20 | container-suseconnect-zypp:SLE-Product-SLES15-SP5-Debuginfo-Updates | SLE-> | No | ---- | ---- 21 | container-suseconnect-zypp:SLE-Product-SLES15-SP5-Pool | SLE-> | Yes | (r ) Yes | No 22 | container-suseconnect-zypp:SLE-Product-SLES15-SP5-Source-Pool | SLE-> | No | ---- | ---- 23 | container-suseconnect-zypp:SLE-Product-SLES15-SP5-Updates | SLE-> | Yes | (r ) Yes | Yes 24 | container-suseconnect-zypp:SLE15-SP5-Installer-Updates | SLE-> | No | ---- | ----

lgruendh commented 3 weeks ago

Looking good, thank you for your help!

alexarnoldy commented 2 weeks ago

Hi Lukas, I've made some progress, but the fixes I'm applying aren't the same as what you discovered. What I've found is that the variable KERNEL_VERSION in the nvidia-driver script gets manipulated several times over and comes out with the wrong value for the containing function.

For example, in the _resolve_kernel_version() function, it is set to "5.14.21-150500.55.65 5.14.21". Note that the second value shouldn't be there.

As well, in the function _install_prerequisites() , it is set to "5.14.21-150500.55.65 5.14.21-150500.55.65-default". In this case only the second value should be there.

There are more examples but I've taken it as far as I can. I've tried working through a bunch of make errors but I'm starting to spend too much time on it. Nvidia is going to have to fix that script themselves. I'm going to experiment on a different host that was working previously and then file an issue with them.

alexarnoldy commented 1 week ago

Hi Lukas, I think we've finally got a handle on what is causing this. We've discovered that the behavior exists with kernel version 5.14.21-150500.55.65.1, but not with version 5.14.21-150500.55.62.2. This fully reproducible, however, I see that you had an issue with version 5.14.21-150500.55.62.2. We did run into an issue where the SUSE subscriptions on a host had not been enabled correctly. This didn't seem to cause problems for the host during normal operations, but kept the container image from running. We cleaned up the registration, as per this doc (https://www.suse.com/support/kb/doc/?id=000019054), and that resolved the problems.

We currently believe the issue is in the way the nvidia-driver script is parsing data, though I'm at a loss at to why it can parse the same string ending in "2.2", but not "5.1". It's possibly pulling data from something else that may have changed format with the last kernel change. I'm going to file an issue against the script.

lgruendh commented 1 week ago

We are using Suse Manager, that is why no SUSE subscriptions is enabled. Thank you for your help so far.

alexarnoldy commented 1 week ago

I've created this issue on the Nvidia GitLab repo: https://gitlab.com/nvidia/container-images/driver/-/issues/52

alexarnoldy commented 6 days ago

Hi Lukas,

I've updated the upstream document to include the workaround I've provided in the GitLab issue.

Feel free to test out the whole doc (it now leverages Rancher for some of the command line stuff) or just the fix, which is item 3.a. in the section "Building the container image".