NVIDIA / cloud-native-stack

Run cloud native workloads on NVIDIA GPUs
Apache License 2.0
119 stars 47 forks source link

With egx-platform 4.1 on Ubuntu, Validating CUDA with GPU task hangs #9

Closed kjw3 closed 2 years ago

kjw3 commented 2 years ago

Host OS: Ubuntu 20.04 LTS EGX-Platform 4.1 (installed via playbook)

When running setup.sh validate, I'm seeing the Validating the CUDA with GPU task hanging.

TASK [Collecting Number of GPU's] **************************************************************************************
changed: [172.16.100.10]

TASK [Validating the nvidia-smi on Kubernetes] *************************************************************************
changed: [172.16.100.10]

TASK [Validating the CUDA with GPU] ************************************************************************************

Just hangs here. If I cancel an rerun, all the validation works, but this task fails saying it is already running.

Looking at the logs of the cuda-vector-add pod, things look good.

nvidia@kejones-egx-stack-01:~$ kubectl get pods
NAME                                                              READY   STATUS      RESTARTS   AGE
cuda-vector-add                                                   0/1     Completed   0          5m29s
gpu-operator-1634331603-node-feature-discovery-master-6cccwnnmg   1/1     Running     0          31m
gpu-operator-1634331603-node-feature-discovery-worker-nkkdg       1/1     Running     0          31m
gpu-operator-7d5bf78f5c-z9xfs                                     1/1     Running     0          31m
nvidia@kejones-egx-stack-01:~$ kubectl logs cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
kjw3 commented 2 years ago

Rerunning shows below

TASK [Validating the CUDA with GPU] ************************************************************************************
fatal: [172.16.100.10]: FAILED! => {"changed": true, "cmd": "timeout 60 kubectl run cuda-vector-add --rm -t -i --restart=Never --image=k8s.gcr.io/cuda-vector-add:v0.1", "delta": "0:00:00.045969", "end": "2021-10-15 21:32:45.031626", "msg": "non-zero return code", "rc": 1, "start": "2021-10-15 21:32:44.985657", "stderr": "Error from server (AlreadyExists): pods \"cuda-vector-add\" already exists", "stderr_lines": ["Error from server (AlreadyExists): pods \"cuda-vector-add\" already exists"], "stdout": "", "stdout_lines": []}
...ignoring
angudadevops commented 2 years ago

@kjw3

couldn't replicate this issue from our end. please let us know if you still see the issue.

Thanks Anurag G