NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.68k stars 275 forks source link

Can't install because '--leader-elect' is missing #181

Closed gecube closed 3 years ago

gecube commented 3 years ago

1. Quick Debug Checklist

What is going on.

  1. I took the instructions from https://developer.nvidia.com/blog/announcing-containerd-support-for-the-nvidia-gpu-operator/
  2. clone the current repo
  3. cd deployments/gpu-operator

helm install --wait --generate-name \ . \ --set operator.defaultRuntime=containerd \ --set toolkit.env[0].name=CONTAINERD_CONFIG \ --set toolkit.env[0].value=/etc/containerd/config.toml \ --set toolkit.env[1].name=CONTAINERD_SOCKET \ --set toolkit.env[1].value=/run/containerd/containerd.sock \ --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \ --set toolkit.env[2].value=nvidia \ --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \ --set toolkit.env[3].value=true

5. I was hit by https://github.com/NVIDIA/gpu-operator/issues/174
6. Also gpu-operator pod is created. But:

kubectl get pods

NAME READY STATUS RESTARTS AGE gpu-operator-56bfbfd666-2g27g 0/1 CrashLoopBackOff 10 31m gpu-operator-node-feature-discovery-master-dcf999dc8-rzkv6 0/1 ErrImagePull 0 31m gpu-operator-node-feature-discovery-worker-69jw9 0/1 ContainerCreating 0 31m


I checked the logs and I found that `--leader-elect` argument is not supported

kubectl logs pod/gpu-operator-56bfbfd666-2g27g unknown flag: --leader-elect Usage of gpu-operator: --zap-devel Enable zap development mode (changes defaults to console encoder, debug log level, disables sampling and stacktrace from 'warning' level) --zap-encoder encoder Zap log encoding ('json' or 'console') --zap-level level Zap log level (one of 'debug', 'info', 'error' or any integer value > 0) (default info) --zap-sample sample Enable zap log sampling. Sampling will be disabled for integer log levels > 1 --zap-stacktrace-level level Set the minimum log level that triggers stacktrace generation (default error) --zap-time-encoding timeEncoding Sets the zap time format ('epoch', 'millis', 'nano', or 'iso8601') (default ) unknown flag: --leader-elect


I used the image `nvcr.io/nvidia/gpu-operator:1.6.2`
kpouget commented 3 years ago

this is failing because the YAML you're using (master) doesn't match the operator image (1.6.2), so you would need to:

  1. undeploy (with helm) what you current have live,
  2. delete the CRD oc --ignore-not-found=true delete crd clusterpolicies.nvidia.com
  3. checkout 1.6.2 tag,
  4. redeploy

I got bitten by this error a few times already :)

gecube commented 3 years ago

@kpouget Thanks. I decided to check tags and found 1.6.2 tag of this repo with completely different Helm chart. Uf... And it helped to deploy. I think it is good idea for developers to follow 'always green master branch principle'.

kpouget commented 3 years ago

I think it is good idea for developers to follow 'always green master branch principle'.

I fully agree, I'm actually facing right now a similar issue when trying to deploy the operator as a bundle, as this image doesn't exist yet:

https://github.com/NVIDIA/gpu-operator/blob/d1a6787012175a590b1ffea977cd04d889c5a335/bundle/manifests/gpu-operator.clusterserviceversion.yaml#L163

shivamerla commented 3 years ago

@kpouget We have discussed this internally few times, but no clear conclusion yet. Ideally we can maintain an image with tag latest which always represents changes from master (i.e it gets updated with every merge). Hopefully we will add this soon, so we can use this with helm charts/CSV files etc in master branch.

kpouget commented 3 years ago

@shivamerla yes an image tagged latest or anything similar would be the best, that's easy to automate with tools like Quay.io, I've started using it in the CI for nightly testing