andy108369 commented 1 year ago

provider v0.4.6

I've noticed there was a user deployment creating >24k defunct processes causing the pods getting stuck in Terminating state (after closing the deployment as normal) until force deleted (e.g. kubectl delete pod app-0 -n n4hjommp3gk39apinr95ak8pscadlkd5l8r9ilnr8urjm --grace-period=0 --force).

root@worker-01:~# ps -ef | grep defunct | wc -l
24869
root@worker-01:~# ps -ef | grep defunct | wc -l
24870
root@worker-01:~# ps -ef | grep defunct | head
syslog       775   28612  0 Sep26 ?        00:00:00 [sshd] <defunct>
syslog       916   28612  0 Sep26 ?        00:00:00 [sshd] <defunct>
syslog      1324   28612  0 Sep24 ?        00:00:00 [sshd] <defunct>
syslog      1413   28612  0 Sep26 ?        00:00:00 [sshd] <defunct>
syslog      1696   28612  0 Sep25 ?        00:00:00 [sshd] <defunct>
syslog      1846   28612  0 Sep26 ?        00:00:00 [sshd] <defunct>
syslog      2233   28612  0 Sep25 ?        00:00:00 [sshd] <defunct>
syslog      2330   28612  0 Sep26 ?        00:00:00 [sshd] <defunct>
syslog      2608   28612  0 Sep26 ?        00:00:00 [sshd] <defunct>
syslog      2877   28612  0 Sep26 ?        00:00:00 [sshd] <defunct>

root@worker-01:~# cat /proc/28612/environ | xargs -n1 -0 |grep AKASH
AKASH_ORDER_SEQUENCE=1
AKASH_OWNER=akash16p6lrlxf7f03c0ka8cv4sznr29rym27uv0qz0d
AKASH_DEPLOYMENT_SEQUENCE=12655391
AKASH_CLUSTER_PUBLIC_HOSTNAME=provider.hurricane.akash.pub
AKASH_GROUP_SEQUENCE=1
AKASH_PROVIDER=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
root@worker-01:~# ps -ef|grep 28612 | grep -v defunct
root       28612   28447  0 Sep23 ?        00:00:00 sleep infinity
root       53683   28612  0 Sep23 ?        00:00:14 sshd: /usr/sbin/sshd [listener] 1 of 10-100 startups
pollina+ 1015346   28612  1 Sep25 ?        00:31:47 haproxy -f /etc/haproxy/haproxy.cfg -db
root     1406854   28612  0 Sep24 ?        00:00:00 SCREEN -S change_wallpaper
root     1408887   28612  0 Sep24 ?        00:00:03 SCREEN -S get_template
root     1416357   28612  0 Sep24 ?        00:00:00 SCREEN -S caddy
root     1912455 1650660  0 10:29 pts/0    00:00:00 grep --color=auto 28612

root@worker-01:~# kill -9 28612
root@worker-01:~# ps -ef|grep 28612 | grep -v defunct
root     1987880 1650660  0 10:43 pts/0    00:00:00 grep --color=auto 28612
root@worker-01:~# ps -ef|grep defunct | wc -l
1

Killing that alone didn't change anything. So I've restarted the containerd and kubelet processes.

And then akash-provider pod.

Restarting akash-provider pod caused it to delete all the

Logs

hurricane-deleted-all-leases.log

Excerpt from the logs

D[2023-09-27|10:47:09.418] found existing hostname                      module=provider-cluster cmp=provider cmp=service hostname=www.yuuera.com id=akash1tccyrt46zfygeve328dr0727s2n56864kgk9yx/11996487/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
D[2023-09-27|10:47:09.456] inventory ready                              module=provider-cluster cmp=provider cmp=service cmp=inventory-service
D[2023-09-27|10:47:09.456] cluster resources dump={"nodes":[{"name":"worker-01.hurricane2","allocatable":{"cpu":102000,"gpu":0,"memory":210936557568,"storage_ephemeral":1942146261054},"available":{"cpu":51195,"gpu":0,"memory":64919190062,"storage_ephemeral":1532512632894}}],"total_alloca
table":{"cpu":102000,"gpu":0,"memory":210936557568,"storage_ephemeral":1942146261054,"storage":{"beta3":806857998336}},"total_available":{"cpu":51195,"gpu":0,"memory":64919190062,"storage_ephemeral":1532512632894,"storage":{"beta3":734864473955}}} module=provider-cluster cmp=provider cmp
=service cmp=inventory-service
E[2023-09-27|10:47:09.456] adjust inventory for pending reservation     module=provider-cluster cmp=provider cmp=service cmp=inventory-service error="insufficient capacity"
E[2023-09-27|10:47:09.456] adjust inventory for pending reservation     module=provider-cluster cmp=provider cmp=service cmp=inventory-service error="insufficient capacity"
E[2023-09-27|10:47:09.456] adjust inventory for pending reservation     module=provider-cluster cmp=provider cmp=service cmp=inventory-service error="insufficient capacity"
E[2023-09-27|10:47:09.456] adjust inventory for pending reservation     module=provider-cluster cmp=provider cmp=service cmp=inventory-service error="insufficient capacity"
E[2023-09-27|10:47:09.478] lease not active, not deploying              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/12814829/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
E[2023-09-27|10:47:09.478] execution error                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/12814829/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud state=deploy-
active err="inactive Lease: akash1h2adh8s6ptsx33m6hda7p9kahcdwy09dhr5x90/12814829/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"
E[2023-09-27|10:47:09.479] lease not active, not deploying              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash155q5556fwlnusglyuq6jrxmnpvevyeum6er0s3/12862374/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud
E[2023-09-27|10:47:09.479] execution error                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash155q5556fwlnusglyuq6jrxmnpvevyeum6er0s3/12862374/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=dcloud state=deploy-
active err="inactive Lease: akash155q5556fwlnusglyuq6jrxmnpvevyeum6er0s3/12862374/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"
E[2023-09-27|10:47:09.479] lease not active, not deploying              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1numg2898s79u8qspt3fk7yg9vecdmckwg5q2tl/12672375/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=argoPlacement
E[2023-09-27|10:47:09.479] execution error                              module=provider-cluster cmp=provider cmp=service cmp=deployment-manager lease=akash1numg2898s79u8qspt3fk7yg9vecdmckwg5q2tl/12672375/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk manifest-group=argoPlacement state=
deploy-active err="inactive Lease: akash1numg2898s79u8qspt3fk7yg9vecdmckwg5q2tl/12672375/1/1/akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

andy108369 commented 1 year ago

GPU related observation

After kubelet, containerd and akash-provider pod were restarted and all leases deleted, I've noticed that provider reports 0 GPU:

$ date; provider_info.sh provider.hurricane.akash.pub
Wed Sep 27 01:17:25 PM CEST 2023
type       cpu     gpu  ram                 ephemeral           persistent
used       0       0    0                   0                   0
pending    0       0    0                   0                   0
available  92.895  0    174.18182563781738  1808.7646561246365  883.5681761829183
node       92.895  0    174.18182563781738  1808.7646561246365  N/A

Same in the provider logs:

D[2023-09-27|12:05:38.259] cluster resources dump={"nodes":[{"name":"worker-01.hurricane2","allocatable":{"cpu":102000,"gpu":0,"memory":210936557568,"storage_ephemeral":1942146261054},"available":{"cpu":63895,"gpu":0,"memory":81799612416,"storage_ephemeral":1618949972030}}],"total_allocatable":{"cpu":102000,"gpu":0,"memory":210936557568,"storage_ephemeral":1942146261054,"storage":{"beta3":881387044864}},"total_available":{"cpu":63895,"gpu":0,"memory":81799612416,"storage_ephemeral":1618949972030,"storage":{"beta3":846816845716}}} module=provider-cluster cmp=provider cmp=service cmp=inventory-service

nvidia-smi hangs on the worker node. Stracing it shows it hangs on accessing /proc/driver/nvidia/params:

root@worker-01:~# strace nvidia-smi 
...
...
read(3, "h_ipportnet,ip_set_hash_ipportip"..., 1024) = 1024
close(3)                                = 0
openat(AT_FDCWD, "/proc/driver/nvidia/params", O_RDONLY

The GPU isn't accessible on the host:

root@worker-01:~# cat /proc/driver/nvidia/params
<hangs>

root@worker-01:~# ls -la /dev/dri
total 0
drwxr-xr-x  3 root root        120 Sep 23 12:35 .
drwxr-xr-x 23 root root       4800 Sep 27 11:20 ..
drwxr-xr-x  2 root root        100 Sep 23 12:35 by-path
crw-rw----  1 root video  226,   0 Sep 23 12:35 card0
crw-rw----  1 root video  226,   1 Sep 23 12:35 card1
crw-rw----  1 root render 226, 128 Sep 23 12:35 renderD128

root@worker-01:~# lsmod |grep nvid
nvidia_uvm           1523712  0
nvidia_drm             77824  0
nvidia_modeset       1302528  1 nvidia_drm
nvidia              56537088  17 nvidia_uvm,nvidia_modeset
drm_kms_helper        311296  5 bochs,drm_vram_helper,nvidia_drm
drm                   622592  8 drm_kms_helper,bochs,drm_vram_helper,nvidia,drm_ttm_helper,nvidia_drm,ttm

checking the nvidia-device-plugin


$ kubectl -n nvidia-device-plugin get pods
NAME                              READY   STATUS    RESTARTS   AGE
nvdp-nvidia-device-plugin-xhhdf   1/1     Running   0          3d23h

$ kubectl -n nvidia-device-plugin logs nvdp-nvidia-device-plugin-xhhdf I0923 12:35:44.212685 1 main.go:154] Starting FS watcher. I0923 12:35:44.212740 1 main.go:161] Starting OS watcher. I0923 12:35:44.213078 1 main.go:176] Starting Plugins. I0923 12:35:44.213105 1 main.go:234] Loading configuration. I0923 12:35:44.213221 1 main.go:242] Updating config with default resource matching patterns. I0923 12:35:44.213416 1 main.go:253] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0923 12:35:44.213429 1 main.go:256] Retreiving plugins. I0923 12:35:44.213905 1 factory.go:107] Detected NVML platform: found NVML library I0923 12:35:44.213943 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found I0923 12:35:44.231058 1 server.go:165] Starting GRPC server for 'nvidia.com/gpu' I0923 12:35:44.231676 1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock I0923 12:35:44.237076 1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet I0927 10:46:42.436154 1 main.go:202] inotify: /var/lib/kubelet/device-plugins/kubelet.sock created, restarting. I0927 10:46:42.436176 1 main.go:294] Stopping plugins. I0927 10:46:42.436183 1 server.go:142] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock I0927 10:46:42.436273 1 main.go:176] Starting Plugins. I0927 10:46:42.436281 1 main.go:234] Loading configuration. I0927 10:46:42.436388 1 main.go:242] Updating config with default resource matching patterns. I0927 10:46:42.436447 1 main.go:253] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0927 10:46:42.436457 1 main.go:256] Retreiving plugins. I0927 10:46:42.436480 1 factory.go:107] Detected NVML platform: found NVML library I0927 10:46:42.436498 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found


- bouncing that pod

$ kubectl -n nvidia-device-plugin delete pod nvdp-nvidia-device-plugin-xhhdf pod "nvdp-nvidia-device-plugin-xhhdf" deleted

$ kubectl -n nvidia-device-plugin get pod NAME READY STATUS RESTARTS AGE nvdp-nvidia-device-plugin-4bwdn 0/1 ContainerCreating 0 10s

$ kubectl -n nvidia-device-plugin describe pod ... Events: Type Reason Age From Message

Normal Scheduled 42s default-scheduler Successfully assigned nvidia-device-plugin/nvdp-nvidia-device-plugin-4bwdn to worker-01.hurricane2


- looks like the pod got stuck in `ContainerCreating` state:

$ kubectl -n nvidia-device-plugin get pod NAME READY STATUS RESTARTS AGE nvdp-nvidia-device-plugin-4bwdn 0/1 ContainerCreating 0 2m31s


- manually removing the nvidia kernel drivers isn't working:

root@worker-01:~# lsmod |grep nvid nvidia_uvm 1523712 0 nvidia_drm 77824 0 nvidia_modeset 1302528 1 nvidia_drm nvidia 56537088 17 nvidia_uvm,nvidia_modeset drm_kms_helper 311296 5 bochs,drm_vram_helper,nvidia_drm drm 622592 8 drm_kms_helper,bochs,drm_vram_helper,nvidia,drm_ttm_helper,nvidia_drm,ttm

root@worker-01:~# modprobe -r nvidia_drm

root@worker-01:~# dmesg -T | tail [Wed Sep 27 12:12:05 2023] [drm] [nvidia-drm] [GPU ID 0x00000100] Unloading driver ``` - Going to reboot the worker node **Update:** The gpu is back (accessible) again after the reboot!

troian commented 1 year ago

duplicate of #121

akash-network / support

all leases got deleted upon `akash-provider` pod restart on Hurricane provider #128

Logs

Excerpt from the logs

GPU related observation