NVIDIA / deepops

Tools for building GPU clusters
BSD 3-Clause "New" or "Revised" License
1.25k stars 326 forks source link

Issue with installation of multi-node K8s for Kubeflow on OCI and GCP instances #1222

Closed mshaikh786 closed 2 years ago

mshaikh786 commented 2 years ago

Hello, I am using the following instances on GCP: ` Deepops tag : 22.01

The setup script on bootstrap node runs fine and installs ansible and other tools. I have set the deepops_gpu_operator_enabled: true in config/group_vars/k8s-cluster.yml The following runs and installs K8s on both management and worker node without any issue. When I try to verify the installation for presence of GPU, I get the following: $ ./scripts/k8s/verify_gpu.sh Starting './scripts/k8s/verify_gpu.sh'; DeepOps version '' job_name=cluster-gpu-tests total_gpus=0 Creating/Deleting sandbox Namespace updating test yml downloading containers ... job.batch/cluster-gpu-tests condition met executing ... No resources found in cluster-gpu-verify namespace. GPU verification can't be executed, please check GPU node directly

The state of pods from all namespaces is as follows: `

`

I am seeing the same behaviour on Oracle cloud instances too. any prompt guidance will be highly appreciated.

ajdecon commented 2 years ago

@mshaikh786 :

I'm guessing there's some issue here with GPU Operator not detecting the GPUs. I'm basing this on the lack of GPUs found, and on the pod nvidia-gpu-operator-node-feature-discovery-worker-2pxl4 in CrashLoopBackoff. 😄

Can you please share the following information? This will hopefully give us an idea for why this isn't working, or at least tell us where to dig next.

mshaikh786 commented 2 years ago

Dear @ajdeco, After running the ansible-playbook to install gpu-operator separately, I investigated the pod (nvidia-gpu-operator-node-feature-discovery) in question with CrashLoopBackOff . It appeared to have issue with connecting to node with GPU. Fixing the iptables did the job, and after a short time, the pod was able to connect to the endpoint. The scripts/k8s/verify_gpu.sh is working now.

ajdecon commented 2 years ago

Wonderful, happy to hear it's working! I'll close this issue now, but please feel free to open another if you run into further problems.