Closed andy108369 closed 1 year ago
After kubelet
, containerd
and akash-provider
pod were restarted and all leases deleted, I've noticed that provider reports 0
GPU:
$ date; provider_info.sh provider.hurricane.akash.pub
Wed Sep 27 01:17:25 PM CEST 2023
type cpu gpu ram ephemeral persistent
used 0 0 0 0 0
pending 0 0 0 0 0
available 92.895 0 174.18182563781738 1808.7646561246365 883.5681761829183
node 92.895 0 174.18182563781738 1808.7646561246365 N/A
Same in the provider logs:
D[2023-09-27|12:05:38.259] cluster resources dump={"nodes":[{"name":"worker-01.hurricane2","allocatable":{"cpu":102000,"gpu":0,"memory":210936557568,"storage_ephemeral":1942146261054},"available":{"cpu":63895,"gpu":0,"memory":81799612416,"storage_ephemeral":1618949972030}}],"total_allocatable":{"cpu":102000,"gpu":0,"memory":210936557568,"storage_ephemeral":1942146261054,"storage":{"beta3":881387044864}},"total_available":{"cpu":63895,"gpu":0,"memory":81799612416,"storage_ephemeral":1618949972030,"storage":{"beta3":846816845716}}} module=provider-cluster cmp=provider cmp=service cmp=inventory-service
nvidia-smi
hangs on the worker node. Stracing it shows it hangs on accessing /proc/driver/nvidia/params
:
root@worker-01:~# strace nvidia-smi
...
...
read(3, "h_ipportnet,ip_set_hash_ipportip"..., 1024) = 1024
close(3) = 0
openat(AT_FDCWD, "/proc/driver/nvidia/params", O_RDONLY
The GPU isn't accessible on the host:
root@worker-01:~# cat /proc/driver/nvidia/params
<hangs>
root@worker-01:~# ls -la /dev/dri
total 0
drwxr-xr-x 3 root root 120 Sep 23 12:35 .
drwxr-xr-x 23 root root 4800 Sep 27 11:20 ..
drwxr-xr-x 2 root root 100 Sep 23 12:35 by-path
crw-rw---- 1 root video 226, 0 Sep 23 12:35 card0
crw-rw---- 1 root video 226, 1 Sep 23 12:35 card1
crw-rw---- 1 root render 226, 128 Sep 23 12:35 renderD128
root@worker-01:~# lsmod |grep nvid
nvidia_uvm 1523712 0
nvidia_drm 77824 0
nvidia_modeset 1302528 1 nvidia_drm
nvidia 56537088 17 nvidia_uvm,nvidia_modeset
drm_kms_helper 311296 5 bochs,drm_vram_helper,nvidia_drm
drm 622592 8 drm_kms_helper,bochs,drm_vram_helper,nvidia,drm_ttm_helper,nvidia_drm,ttm
nvidia-device-plugin
$ kubectl -n nvidia-device-plugin get pods
NAME READY STATUS RESTARTS AGE
nvdp-nvidia-device-plugin-xhhdf 1/1 Running 0 3d23h
$ kubectl -n nvidia-device-plugin logs nvdp-nvidia-device-plugin-xhhdf I0923 12:35:44.212685 1 main.go:154] Starting FS watcher. I0923 12:35:44.212740 1 main.go:161] Starting OS watcher. I0923 12:35:44.213078 1 main.go:176] Starting Plugins. I0923 12:35:44.213105 1 main.go:234] Loading configuration. I0923 12:35:44.213221 1 main.go:242] Updating config with default resource matching patterns. I0923 12:35:44.213416 1 main.go:253] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0923 12:35:44.213429 1 main.go:256] Retreiving plugins. I0923 12:35:44.213905 1 factory.go:107] Detected NVML platform: found NVML library I0923 12:35:44.213943 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found I0923 12:35:44.231058 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu' I0923 12:35:44.231676 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock I0923 12:35:44.237076 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet I0927 10:46:42.436154 1 main.go:202] inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting. I0927 10:46:42.436176 1 main.go:294] Stopping plugins. I0927 10:46:42.436183 1 server.go:142] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock I0927 10:46:42.436273 1 main.go:176] Starting Plugins. I0927 10:46:42.436281 1 main.go:234] Loading configuration. I0927 10:46:42.436388 1 main.go:242] Updating config with default resource matching patterns. I0927 10:46:42.436447 1 main.go:253] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0927 10:46:42.436457 1 main.go:256] Retreiving plugins. I0927 10:46:42.436480 1 factory.go:107] Detected NVML platform: found NVML library I0927 10:46:42.436498 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
- bouncing that pod
$ kubectl -n nvidia-device-plugin delete pod nvdp-nvidia-device-plugin-xhhdf pod "nvdp-nvidia-device-plugin-xhhdf" deleted
$ kubectl -n nvidia-device-plugin get pod NAME READY STATUS RESTARTS AGE nvdp-nvidia-device-plugin-4bwdn 0/1 ContainerCreating 0 10s
$ kubectl -n nvidia-device-plugin describe pod ... Events: Type Reason Age From Message
Normal Scheduled 42s default-scheduler Successfully assigned nvidia-device-plugin/nvdp-nvidia-device-plugin-4bwdn to worker-01.hurricane2
- looks like the pod got stuck in `ContainerCreating` state:
$ kubectl -n nvidia-device-plugin get pod NAME READY STATUS RESTARTS AGE nvdp-nvidia-device-plugin-4bwdn 0/1 ContainerCreating 0 2m31s
- manually removing the nvidia kernel drivers isn't working:
root@worker-01:~# lsmod |grep nvid nvidia_uvm 1523712 0 nvidia_drm 77824 0 nvidia_modeset 1302528 1 nvidia_drm nvidia 56537088 17 nvidia_uvm,nvidia_modeset drm_kms_helper 311296 5 bochs,drm_vram_helper,nvidia_drm drm 622592 8 drm_kms_helper,bochs,drm_vram_helper,nvidia,drm_ttm_helper,nvidia_drm,ttm
root@worker-01:~# modprobe -r nvidia_drm
duplicate of #121
provider v0.4.6
I've noticed there was a user deployment creating >24k
defunct
processes causing the pods getting stuck inTerminating
state (after closing the deployment as normal) until force deleted (e.g.kubectl delete pod app-0 -n n4hjommp3gk39apinr95ak8pscadlkd5l8r9ilnr8urjm --grace-period=0 --force
).Killing that alone didn't change anything. So I've restarted the
containerd
andkubelet
processes.And then
akash-provider
pod.Restarting
akash-provider
pod caused it to delete all theLogs
hurricane-deleted-all-leases.log
Excerpt from the logs