NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.88k stars 305 forks source link

v22.9.0 - nvidia-driver-daemonset/nvidia-driver-ctr fails to start #457

Closed jeremy-london closed 1 year ago

jeremy-london commented 1 year ago

1. Quick Debug Checklist

root@rke2-server:~# lsmod | grep -i ipmi_msghandler ipmi_msghandler 106496 1 ipmi_devintf

- [X] Did you apply the CRD (`kubectl describe clusterpolicies --all-namespaces`)

### 1. Issue or feature description
`nvidia-driver-ctr` run and then exits which causes `nvidia-driver-daemonset` to fail -- blocking the rest of the process form continuing as the step checks fail

### 2. Steps to reproduce the issue
Ubuntu 20.04.5 Server
Default Hardening: https://github.com/konstruktoid/hardening  `sudo bash ubuntu.sh`
RKE2 Install: https://docs.rke2.io/install/quickstart
[Nvidia GPU Operator Helm Install](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-helm)

Running behind a MITM proxy and having nodes setup with proxy/ca trust
- driver and helm chart have been configured to add proxy envs and `certConfig` for driver
- Public Registries are mirrored through a Proxy cache and each node is configured with containerd settings as such

### 3. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

#### Helm Config
> `helm upgrade -i gpu-operator nvidia/gpu-operator --namespace gpu-operator --create-namespace -f values.yaml`
`values.yaml`

Default values for gpu-operator.

This is a YAML-formatted file.

Declare variables to be passed into your templates.

platform: openshift: false

nfd: enabled: true

psp: enabled: true

sandboxWorkloads: enabled: true defaultWorkload: "container"

daemonsets: priorityClassName: system-node-critical tolerations:

validator: repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native image: gpu-operator-validator version: "v22.9.0" imagePullPolicy: IfNotPresent imagePullSecrets: [] env: [] args: [] resources: {} plugin: env:

operator: repository: [readactedurl].com/ext.nvcr.io/nvidia image: gpu-operator version: "v22.9.0" imagePullPolicy: IfNotPresent imagePullSecrets: [] priorityClassName: system-node-critical defaultRuntime: containerd runtimeClass: nvidia use_ocp_driver_toolkit: false

cleanup CRD on chart un-install

cleanupCRD: true

upgrade CRD on chart upgrade, requires --disable-openapi-validation flag

to be passed during helm upgrade.

upgradeCRD: true initContainer: image: cuda repository: [readactedurl].com/ext.nvcr.io/nvidia version: 11.7.1-base-ubuntu20.04 imagePullPolicy: IfNotPresent tolerations:

mig: strategy: single

driver: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia image: driver version: "515-signed" imagePullPolicy: IfNotPresent imagePullSecrets: [] rdma: enabled: false useHostMofed: false manager: image: k8s-driver-manager repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native version: v0.4.2 imagePullPolicy: IfNotPresent env:

toolkit: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia/k8s image: container-toolkit version: v1.11.0-ubuntu20.04 imagePullPolicy: IfNotPresent imagePullSecrets: [] env:

devicePlugin: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia image: k8s-device-plugin version: v0.12.3-ubuntu20.04 imagePullPolicy: IfNotPresent imagePullSecrets: [] args: [] env:

standalone dcgm hostengine

dcgm:

disabled by default to use embedded nv-hostengine by exporter

enabled: false repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native image: dcgm version: 3.0.4-1-ubuntu20.04 imagePullPolicy: IfNotPresent hostPort: 5555 args: [] env: [] resources: {}

dcgmExporter: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia/k8s image: dcgm-exporter version: 3.0.4-3.0.0-ubuntu20.04 imagePullPolicy: IfNotPresent env:

gfd: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia image: gpu-feature-discovery version: v0.7.0-ubuntu20.04 imagePullPolicy: IfNotPresent imagePullSecrets: [] env:

migManager: enabled: false repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native image: k8s-mig-manager version: v0.5.0-ubuntu20.04 imagePullPolicy: IfNotPresent imagePullSecrets: [] env:

nodeStatusExporter: enabled: false repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native image: gpu-operator-validator version: "v22.9.0" imagePullPolicy: IfNotPresent imagePullSecrets: [] resources: {}

Experimental and only deploys nvidia-fs driver on Ubuntu

gds: enabled: false repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native image: nvidia-fs version: "515.43.04" imagePullPolicy: IfNotPresent imagePullSecrets: [] env: [] args: []

vgpuManager: enabled: false repository: "" image: vgpu-manager version: "" imagePullPolicy: IfNotPresent imagePullSecrets: [] env: [] resources: {} driverManager: image: k8s-driver-manager repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native version: v0.4.2 imagePullPolicy: IfNotPresent env:

vgpuDeviceManager: enabled: false repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native image: vgpu-device-manager version: "v0.2.0" imagePullPolicy: IfNotPresent imagePullSecrets: [] env: [] config: name: "" default: "default"

vfioManager: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia image: cuda version: 11.7.1-base-ubuntu20.04 imagePullPolicy: IfNotPresent imagePullSecrets: [] env: [] resources: {} driverManager: image: k8s-driver-manager repository: [readactedurl].com/ext.nvcr.io/nvidia/cloud-native version: v0.4.2 imagePullPolicy: IfNotPresent env:

sandboxDevicePlugin: enabled: true repository: [readactedurl].com/ext.nvcr.io/nvidia image: kubevirt-gpu-device-plugin version: v1.2.1 imagePullPolicy: IfNotPresent imagePullSecrets: [] args: [] env: [] resources: {}

node-feature-discovery: worker: tolerations:

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver branch 515 for Linux kernel version 5.4.0-125-generic

Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Updating the package cache... E: Release file for http://us.archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease is not valid yet (invalid for another 3h 58min 53s). Updates for this repository will not be applied. E: Release file for http://us.archive.ubuntu.com/ubuntu/dists/focal-security/InRelease is not valid yet (invalid for another 3h 57min 43s). Updates for this repository will not be applied. E: Release file for http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease is not valid yet (invalid for another 3h 58min 52s). Updates for this repository will not be applied. E: Release file for http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease is not valid yet (invalid for another 3h 57min 42s). Updates for this repository will not be applied. Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs...


 - [X] ~~Output of running a container on the GPU machine: `docker run -it alpine echo foo`~~
 - [X] ~~Docker configuration file: `cat /etc/docker/daemon.json`~~
 - [X] ~~Docker runtime configuration: `docker info | grep runtime`~~
 - [X] ~~NVIDIA shared directory: `ls -la /run/nvidia`~~
 - [X] ~~NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`~~
 - [X] ~~NVIDIA driver directory: `ls -la /run/nvidia/driver`~~
 - [X] kubelet logs `journalctl -u kubelet > kubelet.logs`

Typical pre-driver/pre-toolkit config errors complaining about runtime class.. nothing out of the ordinary in this log stack

jeremy-london commented 1 year ago

Noticed the 525 version of the driver container was pushed yesterday and tried it out -- same issue.. i suspect the package cache might be getting hit with a SSL warning -- just not sure as not logs indicate

root@rke2-server:~/gpu-operator# kubectl logs -f nvidia-driver-daemonset-4zzgx --all-containers=true -n gpu-operator
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command)
Current value of 'nodeType='
Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
node/rke2-agent.[readactedurl].com labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-ns5t8 condition met
unbinding device 0000:03:00.0
unbinding device 0000:05:00.0
unbinding device 0000:0d:00.0
unbinding device 0000:16:00.0
Uncordoning node rke2-agent.[readactedurl].com...
node/rke2-agent.[readactedurl].com already uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/rke2-agent.[readactedurl].com labeled
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-525.60.13
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.60.13...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 5.4.0-125-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
E: Release file for http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease is not valid yet (invalid for another 4h 32min 40s). Updates for this repository will not be applied.
E: Release file for http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease is not valid yet (invalid for another 4h 3min 56s). Updates for this repository will not be applied.
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
shivamerla commented 1 year ago

@jeremy-london Can you double check if the timezone is correctly configured on your node?

jeremy-london commented 1 year ago

@shivamerla I think this is where i am leaning as well -- it appears there is some sort of apt-get update going on in there.. and I found apt-get update on the host shows the same conditions

I update the tzdata and that seemed to fix the host -- ill report back if it fixes the containerd runtime here

Possibly setting TZ or /etc/timezone in the container would solve it, but ultimately if it respects the node its running on then ill get each configured

jeremy-london commented 1 year ago

Changed some configs around -- product required a FIPs compliant kernel so had to rebuild today

DISA STIG + FIPS-updates enabled

Got things back to the same state and re-ran the nvidia-gpu-operator

root@rke2-server:~# kubectl logs -f nvidia-driver-daemonset-8l6jj --all-containers=true -n gpu-operator
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
DRIVER_ARCH is x86_64
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Creating directory NVIDIA-Linux-x86_64-525.60.13
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Verifying archive integrity... OK
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command)
Current value of 'nodeType='
Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
node/rke2-agent.[redactedurl].com labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-v595f condition met
unbinding device 0000:03:00.0
unbinding device 0000:0c:00.0
unbinding device 0000:15:00.0
unbinding device 0000:1e:00.0
Uncordoning node rke2-agent.[redactedurl].com...
node/rke2-agent.[redactedurl].com already uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/rke2-agent.[redactedurl].com labeled
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.60.13...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 5.4.0-1068-fips

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
root@rke2-server:~# uname -r
5.4.0-1068-fips

Seeing the error Could not resolve Linux kernel version -- Is this kernel not supported?

shivamerla commented 1 year ago

@jeremy-london the error is from here. Can you run below command on the node to make sure kernel-headers are available for this kernel?

KERNEL_VERSION=5.4.0-1068-fips && \
apt-cache show "linux-headers-${KERNEL_VERSION}" 2> /dev/null | \
      sed -nE 's/^Version:\s+(([0-9]+\.){2}[0-9]+)[-.]([0-9]+).*/\1-\3/p' | head -1
jeremy-london commented 1 year ago
root@rke2-server:~# KERNEL_VERSION=5.4.0-1068-fips && \
> apt-cache show "linux-headers-${KERNEL_VERSION}" 2> /dev/null | \
>       sed -nE 's/^Version:\s+(([0-9]+\.){2}[0-9]+)[-.]([0-9]+).*/\1-\3/p' | head -1
5.4.0-1068
shivamerla commented 1 year ago

@shivamerla looks like we need to make ubuntu advantage repositories configured on the host accessible to the driver container. Please follow the instructions here to create a ConfigMap with these repositories and injecting them into driver container.

jeremy-london commented 1 year ago

Moreover - I just tested again with 515-signed hoping that might support FIPs kernels.. but no dice

root@rke2-server:~# kubectl logs -f nvidia-driver-daemonset-2g8hk --all-containers=true -n gpu-operator

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver branch 515 for Linux kernel version 5.4.0-1068-fips

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Updating the package cache...
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command)
Current value of 'nodeType='
Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
node/rke2-agent.[redactedurl].com labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-j799c condition met
unbinding device 0000:03:00.0
unbinding device 0000:0c:00.0
unbinding device 0000:15:00.0
unbinding device 0000:1e:00.0
Uncordoning node rke2-agent.[redactedurl].com...
node/rke2-agent.[redactedurl].com already uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/rke2-agent.[redactedurl].com labeled
Installing NVIDIA driver kernel modules...
Hit:1 http://us.archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:2 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:3 http://us.archive.ubuntu.com/ubuntu focal-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal-security InRelease
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package linux-objects-nvidia-515-server-5.4.0-1068-fips
E: Couldn't find any package by glob 'linux-objects-nvidia-515-server-5.4.0-1068-fips'
E: Couldn't find any package by regex 'linux-objects-nvidia-515-server-5.4.0-1068-fips'
E: Unable to locate package linux-signatures-nvidia-5.4.0-1068-fips
E: Couldn't find any package by glob 'linux-signatures-nvidia-5.4.0-1068-fips'
E: Couldn't find any package by regex 'linux-signatures-nvidia-5.4.0-1068-fips'
E: Unable to locate package linux-modules-nvidia-515-server-5.4.0-1068-fips
E: Couldn't find any package by glob 'linux-modules-nvidia-515-server-5.4.0-1068-fips'
E: Couldn't find any package by regex 'linux-modules-nvidia-515-server-5.4.0-1068-fips'
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

If this kernel is not supported.. what options exist? Driver install on the host directly, maybe toolkit as well if similar issues creep up?

shivamerla commented 1 year ago

@jeremy-london can you follow instructions from comment https://github.com/NVIDIA/gpu-operator/issues/457#issuecomment-1343449476. Yes another option is to pre-install drivers on the host in this case. Container-Toolkit doesn't need to be pre-installed as it doesn't have kernel specific runtime dependencies like the driver container.

jeremy-london commented 1 year ago

@shivamerla Seems we are getting closer --

I added the following to a file, then created the config map/helm update with the new settings

deb https://esm.ubuntu.com/cis/ubuntu focal main
# deb-src https://esm.ubuntu.com/cis/ubuntu focal main

deb https://esm.ubuntu.com/infra/ubuntu focal-infra-security main
# deb-src https://esm.ubuntu.com/infra/ubuntu focal-infra-security main

deb https://esm.ubuntu.com/infra/ubuntu focal-infra-updates main
# deb-src https://esm.ubuntu.com/infra/ubuntu focal-infra-updates main

deb https://esm.ubuntu.com/fips-updates/ubuntu focal-updates main
# deb-src https://esm.ubuntu.com/fips-updates/ubuntu focal-updates main

(That's all the extra ones i have on the host)

Now im dealing with a few other packages not streaming in but seeing an nvidia specific one and wondering if the kernel version is in the support pool for those packages

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver branch 515 for Linux kernel version 5.4.0-1068-fips

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Updating the package cache...
Installing NVIDIA driver kernel modules...
Hit:1 http://us.archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:2 http://us.archive.ubuntu.com/ubuntu focal-security InRelease
Hit:3 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal-security InRelease
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package linux-objects-nvidia-515-server-5.4.0-1068-fips
E: Couldn't find any package by glob 'linux-objects-nvidia-515-server-5.4.0-1068-fips'
E: Couldn't find any package by regex 'linux-objects-nvidia-515-server-5.4.0-1068-fips'
E: Unable to locate package linux-signatures-nvidia-5.4.0-1068-fips
E: Couldn't find any package by glob 'linux-signatures-nvidia-5.4.0-1068-fips'
E: Couldn't find any package by regex 'linux-signatures-nvidia-5.4.0-1068-fips'
E: Unable to locate package linux-modules-nvidia-515-server-5.4.0-1068-fips
E: Couldn't find any package by glob 'linux-modules-nvidia-515-server-5.4.0-1068-fips'
E: Couldn't find any package by regex 'linux-modules-nvidia-515-server-5.4.0-1068-fips'
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
shivamerla commented 1 year ago

@jeremy-london yes, that is correct, precompiled packages are not available for this kernel. Please use driver.version as 525.60.13 instead of 515-signed.