NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.72k stars 280 forks source link

Daemonset Unable to find RHCOS toolkit-driver image when installing on OKD 4.13 #592

Open lohwanjing opened 10 months ago

lohwanjing commented 10 months ago

1. Quick Debug Information

2. Issue or feature description

When installing the GPU Operator in a fully air-gapped environment in OKD 4.13, I would expect the installation to complete successfully once the images and operators are mirrored accordingly with the appropriate ImageContentSourcePolicy set up, Currently, after the gpu-operator successfully installs, after setting up the default CSV, the daemonset installation fails with a crash-loop backoff. Based on the warning, the RHCOS image tag cannot be found, and it defaults back to entitlement fallback which fails due to being in a disconnected environment,

For OCP installations (and I would assume OKD as well), the gpu operator should be using the RHCOS container-toolkit, which would allow for entitlement-free installation of the NVIDIA Drivers, and therefore work in a disconnected environment.

3. Steps to reproduce the issue

  1. Mirror Openshift Node Feature Discovery Operator, NVIDIA GPU Operator and their associated images using oc-mirror tool
  2. Upload images to local container registry
  3. Update catalog information and imagecontentsourcepolicy in OKD (to allow installation from Operator Hub)
  4. Install NFD, ensure installation is successful, and that GPU is detected on worker node
  5. Install GPU Operator, set up CSV
  6. Daemonset initialises on worker node container GPU, but fails to initialise successfully.

4. Information to attach (optional if deemed irrelevant)

daemonset_logs

daemonset

Note the tag on the daemonset which indicates the image is missing.

I saw issue #428, which was resolved by manually tagging the container toolkit container. In this case, manually tagging the container did not result in any changes.

imageset

Additionally, the container toolkit (at least from the docs) seems to run as a sidecar container, but not of the pods that were launched seem to contain the image within their specs.

shivamerla commented 10 months ago

@lohwanjing can you run oc get imagesstream driver-toolkit -n openshift -o yaml. Also logs from the operator pod will help.

lohwanjing commented 10 months ago

Hi, attached is the imagestream logs. Note that we use ImageContentSourcePolicy to point quay.io to our internal mirror. Will get you the operator pod logs later

driver-toolkit.txt

lohwanjing commented 9 months ago

@shivamerla

Hi, apologies for the delay, attached are the logs for the operator pods, both before and after installing the CSP

gpu-operator-after-CSP.log gpu-operator-before-CSP.log

Additionally this is the list of pods in the namespace

podlist.txt

KricejJanezMartin commented 1 month ago

@lohwanjing

Hi, did you manage to find any solution to the described problem ? I am also currently trying to install nvidia-operator in my OKD 4.15. I would appreciate any insight.

Cheers!