NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.87k stars 305 forks source link

imagepullbackoff on the nvidia-operator w/ nvcr.io/nvidia/cuda #560

Open jayunit100 opened 1 year ago

jayunit100 commented 1 year ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

yes

1. Issue or feature description

were seeing image pull backups on the operator:

 kubectl get pods -A | grep nvidia
gpu-operator-resources   nvidia-container-toolkit-daemonset-xwp6p                      0/1     Init:ImagePullBackOff   0              73m
gpu-operator-resources   nvidia-driver-daemonset-jk2t8                                 1/1     Running                 12 (12m ago)   73m

like so:

  Back-off pulling image "[nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59](http://nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59)"

The workaround we found is, removing the tag from nvcr.io/nvidia/cuda so it uses latest.

1. Reproducing

We'll add details in a few, just wanted to make sure we filed this in case others are hitting it as wel..

shivamerla commented 1 year ago

@jayunit100 which version of GPU Operator is this? Please note that CUDA base images are used as initContainers in some pods deployed by the operator and the version can be controlled in the ClusterPolicy here.

jayunit100 commented 1 year ago

hi @shivamerla --> dont recall the operator version can you suggest a few image sha's that i can try, though ?

or a way to query for the nvidia repos image tags? i tried using

-> % imgpkg tag list -i  nvcr.io/nvidia/cuda    
imgpkg: Error: unrecognized challenge: 

but it looks like theyre not accessible as a standard docker repo that way ... and nvcr.io sais authorization required...

shivamerla commented 1 year ago

you can try using this image.

nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04
 sudo docker regctl manifest get nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04
Name:        nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04
MediaType:   application/vnd.docker.distribution.manifest.list.v2+json
Digest:      sha256:f8870283bea6a85ba4b4a5e1b65158dd15e8009e433539e7c83c94707e703a1b

Manifests:   

  Name:      nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04@sha256:51c9b445ee2a1eb94631ed5dabc755e915db7485fee3cc5c754df9298b16e81e
  Digest:    sha256:51c9b445ee2a1eb94631ed5dabc755e915db7485fee3cc5c754df9298b16e81e
  MediaType: application/vnd.docker.distribution.manifest.v2+json
  Platform:  linux/arm64

  Name:      nvcr.io/nvidia/cuda:12.2.0-base-ubuntu22.04@sha256:1069ccd2910506f68e1d7c0907a32aaa877b8038d1aa24cb7ffb2d2a85d725c7
  Digest:    sha256:1069ccd2910506f68e1d7c0907a32aaa877b8038d1aa24cb7ffb2d2a85d725c7
  MediaType: application/vnd.docker.distribution.manifest.v2+json
  Platform:  linux/amd64