Open rupang790 opened 3 years ago
Can you attach logs of gpu-operator pod to debug?
@shivamerla, Sorry I forgot to attach the logs of gpu-operator pod. gpu-operator-76d5d98454-6g727-gpu-operator.log
@rupang790 Looks like you are installing very old version(--devel). Any reason for that? Its failing because gpu-operator-resources
namespace is missing. Newer versions create this namespace automatically.
{"level":"info","ts":1625012388.7674541,"logger":"controller_clusterpolicy","msg":"Couldn't create","ServiceAccount":"nvidia-driver","Namespace":"gpu-operator-resources","Error":"namespaces \"gpu-operator-resources\" not found"}
{"level":"error","ts":1625012388.7676754,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"clusterpolicy-controller","request":"/cluster-policy","error":"namespaces \"gpu-operator-resources\" not found","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:88"}
@shivamerla, few weeks ago, I tried to install newer version of gpu-operator(v1.6.0 maybe) on OKD Cluster 4.5.0-0.okd-2020-10-15-235428 and some pods were not running well.( sorry I do not have any logs about that) Then I installed 1.3.0 and it worked well. That is reason why I am using 1.3.0 on my cluster.
@shivamerla, As you said I used very old version. So I am trying to install 1.7.1 version on my cluster, but it seems having issue with toolkit.
nvidia-operator-validator
pod is stuck on Init:CrashLoopBackOff
status and I can see the error on toolkit validator as below:
On nvidia-container-toolkit-daemonset
pod, driver-validation
container shows results of nvidia-smi
command and nvidia-container-toolkit-ctr
container shows logs as:
time="2021-07-07T04:22:19Z" level=info msg="Starting nvidia-toolkit"
time="2021-07-07T04:22:19Z" level=info msg="Parsing arguments"
time="2021-07-07T04:22:19Z" level=info msg="Verifying Flags"
time="2021-07-07T04:22:19Z" level=info msg=Initializing
time="2021-07-07T04:22:19Z" level=info msg="Installing toolkit"
time="2021-07-07T04:22:19Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2021-07-07T04:22:19Z" level=info msg="Successfully parsed arguments"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Removing existing NVIDIA container toolkit installation"
time="2021-07-07T04:22:19Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
time="2021-07-07T04:22:19Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container.so.1' => '/usr/lib64/libnvidia-container.so.1.4.0'"
time="2021-07-07T04:22:19Z" level=info msg="Installing '/usr/lib64/libnvidia-container.so.1.4.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.4.0'"
time="2021-07-07T04:22:19Z" level=info msg="Installed '/usr/lib64/libnvidia-container.so.1' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.4.0'"
time="2021-07-07T04:22:19Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.4.0'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container runtime from '/usr/bin/nvidia-container-runtime'"
time="2021-07-07T04:22:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime'"
time="2021-07-07T04:22:19Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
time="2021-07-07T04:22:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli'"
time="2021-07-07T04:22:19Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Installing executable '/usr/bin/nvidia-container-toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Installing '/usr/bin/nvidia-container-toolkit' to '/usr/local/nvidia/toolkit/nvidia-container-toolkit.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created '/usr/local/nvidia/toolkit/nvidia-container-toolkit.real'"
time="2021-07-07T04:22:19Z" level=info msg="Created wrapper '/usr/local/nvidia/toolkit/nvidia-container-toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook' -> 'nvidia-container-toolkit'"
time="2021-07-07T04:22:19Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
time="2021-07-07T04:22:19Z" level=info msg="Setting up runtime"
time="2021-07-07T04:22:19Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2021-07-07T04:22:19Z" level=info msg="Successfully parsed arguments"
time="2021-07-07T04:22:19Z" level=info msg="Starting 'setup' for crio"
time="2021-07-07T04:22:19Z" level=info msg="Waiting for signal"
According to https://github.com/NVIDIA/gpu-operator/issues/167#issuecomment-808524121, I also tried 1.6.2 version but it shows error on validation pod as:
I changed config of CRI-O about hooks.d as /run/containers/oci/hooks.d
and restart crio service.
How to solve this? If it is solved, I will not use old version for testing local helm installation.
@shivamerla, Install GPU-Operator version 1.5.2 on OKD 4.5 Cluster was succeeded. So GPU-Operator 1.5.2 version will be used for my project. For the restricted Installation of 1.5.2 GPU-Operator, I would like to confirm how to prepare and Install.
$ helm lint ./gpu-operator
$ kubectl create ns gpu-operator
$ helm install ./gpu-operator -n gpu-operator --version 1.5.2 --set operator.defaultRuntime=crio,toolkit.version=1.4.0-ubi8 --wait --generate-name
Because after that procedure, GPU-Operator occurred CrashLoopBackOFF and I saw the logs of pod:
$ kubectl logs -n gpu-operator gpu-operator-8678476587-jr24j
unknown flag: --leader-elect
Usage of gpu-operator:
unknown flag: --leader-elect
--zap-devel Enable zap development mode (changes defaults to console encoder, debug log level, disables sampling and stacktrace from 'warning' level)
--zap-encoder encoder Zap log encoding ('json' or 'console')
--zap-level level Zap log level (one of 'debug', 'info', 'error' or any integer value > 0) (default info)
--zap-sample sample Enable zap log sampling. Sampling will be disabled for integer log levels > 1
--zap-stacktrace-level level Set the minimum log level that triggers stacktrace generation (default error)
--zap-time-encoding timeEncoding Sets the zap time format ('epoch', 'millis', 'nano', or 'iso8601') (default )
And the events of GPU-Operator Namespaces as:
$ kubectl get events -n gpu-operator
LAST SEEN TYPE REASON OBJECT MESSAGE
3m43s Normal Scheduled pod/gpu-operator-8678476587-jr24j Successfully assigned gpu-operator/gpu-operator-8678476587-jr24j to k8s-master01
3m43s Normal AddedInterface pod/gpu-operator-8678476587-jr24j Add eth0 [10.244.32.142/32] from k8s-pod-network
2m8s Normal Pulled pod/gpu-operator-8678476587-jr24j Container image "mirror.eluon.okd.com:5000/nvidia/gpu-operator:1.5.2" already present on machine
2m8s Normal Created pod/gpu-operator-8678476587-jr24j Created container gpu-operator
2m7s Normal Started pod/gpu-operator-8678476587-jr24j Started container gpu-operator
2m6s Warning BackOff pod/gpu-operator-8678476587-jr24j Back-off restarting failed container
3m44s Normal SuccessfulCreate replicaset/gpu-operator-8678476587 Created pod: gpu-operator-8678476587-jr24j
3m44s Normal ScalingReplicaSet deployment/gpu-operator Scaled up replica set gpu-operator-8678476587 to 1
Do you have any idea about it?
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
I am trying to install GPU-Operator on OKD 4.5 cluster in Restricted Network environment. For that, I clone the nvidia/gpu-operator and change some values on values.yaml and operator.yaml for my cluster. So tried to install by
helm install --devel ./gpu-operator --set platform.openshift=true,operator.defaultRuntime=crio,toolkit.version=1.3.0-ubi8,nfd.enabled=false --wait --generate-name
, check gpu-operator was running well (without any error) but there are no "gpu-operator-resources" namespaces and pods such as dcmg, toolkits and validation etc..I already check that gpu-operator installation with
helm install nvidia/gpu-operator
. What am I Missing ?2. Steps to reproduce the issue
git clone https://github.com/NVIDIA/gpu-operator.git
helm install
to install3. Information to attach (optional if deemed irrelevant)
[ ] kubernetes pods status:
kubectl get pods --all-namespaces
[ ] kubernetes daemonset status:
kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo
[ ] Docker configuration file:
cat /etc/docker/daemon.json
[ ] Docker runtime configuration:
docker info | grep runtime
[ ] NVIDIA shared directory:
ls -la /run/nvidia
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs