NVIDIA / deepops

Tools for building GPU clusters
BSD 3-Clause "New" or "Revised" License
1.25k stars 326 forks source link

./scripts/k8s/verify_gpu.sh fail #1240

Closed TranThanh96 closed 1 year ago

TranThanh96 commented 1 year ago

version: tag/22.08 ubuntu: Ubuntu 20.04.5 LTS

Starting './scripts/k8s/verify_gpu.sh';  DeepOps version ''
job_name=cluster-gpu-tests
total_gpus=0
Creating/Deleting sandbox Namespace
updating test yml
downloading containers ...
job.batch/cluster-gpu-tests condition met
executing ...
No resources found in cluster-gpu-verify namespace.
GPU verification can't be executed, please check GPU node directly

check k9s: image

How can I fix this. someime k9s not error: image

TranThanh96 commented 1 year ago

nvidia-driver still work: image

supertetelman commented 1 year ago

It looks like the cluster eventually came up in a health state. Are you still having problems with this? It look like the node labeler was having issues that eventually cleared up. It is hard to debug this issue without any logs from the failing POD or the output from describe nodes.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.