Closed mshaikh786 closed 2 years ago
@mshaikh786 :
I'm guessing there's some issue here with GPU Operator not detecting the GPUs. I'm basing this on the lack of GPUs found, and on the pod nvidia-gpu-operator-node-feature-discovery-worker-2pxl4
in CrashLoopBackoff. 😄
Can you please share the following information? This will hopefully give us an idea for why this isn't working, or at least tell us where to dig next.
kubectl logs nvidia-gpu-operator-node-feature-discovery-worker-2pxl4
kubectl describe
for the nodes where you expect to find GPUsgroup_vars/k8s-cluster.yml
fileDear @ajdeco, After running the ansible-playbook to install gpu-operator separately, I investigated the pod (nvidia-gpu-operator-node-feature-discovery) in question with CrashLoopBackOff . It appeared to have issue with connecting to node with GPU. Fixing the iptables did the job, and after a short time, the pod was able to connect to the endpoint. The scripts/k8s/verify_gpu.sh is working now.
Wonderful, happy to hear it's working! I'll close this issue now, but please feel free to open another if you run into further problems.
Hello, I am using the following instances on GCP: ` Deepops tag : 22.01
The setup script on bootstrap node runs fine and installs ansible and other tools. I have set the
deepops_gpu_operator_enabled: true
inconfig/group_vars/k8s-cluster.yml
The following runs and installs K8s on both management and worker node without any issue. When I try to verify the installation for presence of GPU, I get the following:$ ./scripts/k8s/verify_gpu.sh Starting './scripts/k8s/verify_gpu.sh'; DeepOps version '' job_name=cluster-gpu-tests total_gpus=0 Creating/Deleting sandbox Namespace updating test yml downloading containers ... job.batch/cluster-gpu-tests condition met executing ... No resources found in cluster-gpu-verify namespace. GPU verification can't be executed, please check GPU node directly
The state of pods from all namespaces is as follows: `
`
I am seeing the same behaviour on Oracle cloud instances too. any prompt guidance will be highly appreciated.