Closed andy108369 closed 4 months ago
current limit of 512MiB set to akash-inventory-operator might be too small https://github.com/akash-network/helm-charts/blob/provider-8.0.3/charts/akash-inventory-operator/templates/deployment.yaml#L36
512MiB
akash-inventory-operator
$ kubectl -n akash-services get pods NAME READY STATUS RESTARTS AGE akash-hostname-operator-6795445db-jf46g 1/1 Running 0 27m akash-inventory-operator-75d7758b86-kqk6s 1/1 Running 4 (68s ago) 29m akash-node-1-0 1/1 Running 0 48m akash-provider-0 1/1 Running 0 26m
root@node3:~# dmesg -T -l alert -l crit -l emerg -l err ... [Tue Feb 20 19:10:18 2024] Memory cgroup out of memory: Killed process 80061 (provider-servic) total-vm:5025880kB, anon-rss:517808kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1620kB oom_score_adj:999 [Tue Feb 20 19:17:12 2024] Memory cgroup out of memory: Killed process 125599 (provider-servic) total-vm:5173344kB, anon-rss:516256kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1636kB oom_score_adj:999 [Tue Feb 20 19:17:12 2024] Memory cgroup out of memory: Killed process 125701 (provider-servic) total-vm:5173344kB, anon-rss:516256kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1636kB oom_score_adj:999 [Tue Feb 20 19:24:38 2024] Memory cgroup out of memory: Killed process 166260 (provider-servic) total-vm:5322344kB, anon-rss:517112kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1660kB oom_score_adj:999 [Tue Feb 20 19:24:38 2024] Memory cgroup out of memory: Killed process 166287 (provider-servic) total-vm:5322344kB, anon-rss:517112kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1660kB oom_score_adj:999
provider: sg.lneq
sg.lneq
$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image' NAME IMAGE akash-hostname-operator-6795445db-jf46g ghcr.io/akash-network/provider:0.4.8 akash-inventory-operator-544c75d855-qs8lh ghcr.io/akash-network/provider:0.4.8 akash-node-1-0 ghcr.io/akash-network/node:0.30.0 akash-provider-0 ghcr.io/akash-network/provider:0.4.8
$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME node1 Ready control-plane 71m v1.28.6 10.74.43.129 <none> Ubuntu 22.04.4 LTS 6.5.0-18-generic containerd://1.7.11 node2 Ready control-plane 71m v1.28.6 10.74.43.133 <none> Ubuntu 22.04.4 LTS 6.5.0-18-generic containerd://1.7.11 node3 Ready <none> 69m v1.28.6 10.74.43.131 <none> Ubuntu 22.04.4 LTS 6.5.0-18-generic containerd://1.7.11 node4 Ready <none> 69m v1.28.6 10.8.68.129 <none> Ubuntu 22.04.4 LTS 6.5.0-18-generic containerd://1.7.11
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph status cluster: id: 69d6af8d-3dfa-47cd-8f6e-bcbc5320987f health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 42m) mgr: b(active, since 39m), standbys: a osd: 8 osds: 8 up (since 40m), 8 in (since 40m) data: pools: 2 pools, 257 pgs objects: 7 objects, 577 KiB usage: 4.9 GiB used, 28 TiB / 28 TiB avail pgs: 257 active+clean $ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 27.94470 root default -5 6.98618 host node1 2 nvme 3.49309 osd.2 up 1.00000 1.00000 5 nvme 3.49309 osd.5 up 1.00000 1.00000 -9 6.98618 host node2 6 nvme 3.49309 osd.6 up 1.00000 1.00000 7 nvme 3.49309 osd.7 up 1.00000 1.00000 -3 6.98618 host node3 1 nvme 3.49309 osd.1 up 1.00000 1.00000 4 nvme 3.49309 osd.4 up 1.00000 1.00000 -7 6.98618 host node4 0 nvme 3.49309 osd.0 up 1.00000 1.00000 3 nvme 3.49309 osd.3 up 1.00000 1.00000
Interestingly, comparing this provider sg.lneq to sg.lnlm, the latter doesn't experience this issue:
sg.lnlm
$ kubectl -n akash-services get pods NAME READY STATUS RESTARTS AGE akash-hostname-operator-6795445db-5xhq5 1/1 Running 0 5d7h akash-inventory-operator-75d7758b86-gh2wj 1/1 Running 0 5d6h akash-node-1-0 1/1 Running 0 5d7h akash-provider-0 1/1 Running 0 21h $ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph status cluster: id: 661a3fe0-5ff2-4575-a421-f812501f463c health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 5d) mgr: a(active, since 5d), standbys: b osd: 6 osds: 6 up (since 5d), 6 in (since 5d) data: pools: 2 pools, 257 pgs objects: 491 objects, 889 MiB usage: 9.8 GiB used, 5.2 TiB / 5.2 TiB avail pgs: 257 active+clean io: client: 341 B/s wr, 0 op/s rd, 0 op/s wr $ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 5.23975 root default -3 1.74658 host node1 0 nvme 0.87329 osd.0 up 1.00000 1.00000 2 nvme 0.87329 osd.2 up 1.00000 1.00000 -5 1.74658 host node2 1 nvme 0.87329 osd.1 up 1.00000 1.00000 4 nvme 0.87329 osd.4 up 1.00000 1.00000 -7 1.74658 host node3 3 nvme 0.87329 osd.3 up 1.00000 1.00000 5 nvme 0.87329 osd.5 up 1.00000 1.00000 $ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME node1 Ready control-plane 5d8h v1.28.6 192.168.0.100 <none> Ubuntu 22.04.3 LTS 5.15.0-94-generic containerd://1.7.11 node2 Ready control-plane 5d8h v1.28.6 192.168.0.101 <none> Ubuntu 22.04.3 LTS 5.15.0-94-generic containerd://1.7.11 node3 Ready <none> 5d8h v1.28.6 192.168.0.102 <none> Ubuntu 22.04.3 LTS 5.15.0-94-generic containerd://1.7.11
I've lifted the RAM limit to 1GiB to see if it helps.
1GiB
current limit of
512MiB
set toakash-inventory-operator
might be too small https://github.com/akash-network/helm-charts/blob/provider-8.0.3/charts/akash-inventory-operator/templates/deployment.yaml#L36Env
Observation: 3 vs 4 nodes
Interestingly, comparing this provider
sg.lneq
tosg.lnlm
, the latter doesn't experience this issue:Next steps
I've lifted the RAM limit to
1GiB
to see if it helps.