akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

[helm-charts] akash-inventory-operator needs more memory #185

Closed andy108369 closed 4 months ago

andy108369 commented 4 months ago

current limit of 512MiB set to akash-inventory-operator might be too small https://github.com/akash-network/helm-charts/blob/provider-8.0.3/charts/akash-inventory-operator/templates/deployment.yaml#L36

$ kubectl -n akash-services get pods
NAME                                        READY   STATUS    RESTARTS      AGE
akash-hostname-operator-6795445db-jf46g     1/1     Running   0             27m
akash-inventory-operator-75d7758b86-kqk6s   1/1     Running   4 (68s ago)   29m
akash-node-1-0                              1/1     Running   0             48m
akash-provider-0                            1/1     Running   0             26m
root@node3:~# dmesg -T -l alert -l crit -l emerg -l err 
...
[Tue Feb 20 19:10:18 2024] Memory cgroup out of memory: Killed process 80061 (provider-servic) total-vm:5025880kB, anon-rss:517808kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1620kB oom_score_adj:999
[Tue Feb 20 19:17:12 2024] Memory cgroup out of memory: Killed process 125599 (provider-servic) total-vm:5173344kB, anon-rss:516256kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1636kB oom_score_adj:999
[Tue Feb 20 19:17:12 2024] Memory cgroup out of memory: Killed process 125701 (provider-servic) total-vm:5173344kB, anon-rss:516256kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1636kB oom_score_adj:999
[Tue Feb 20 19:24:38 2024] Memory cgroup out of memory: Killed process 166260 (provider-servic) total-vm:5322344kB, anon-rss:517112kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1660kB oom_score_adj:999
[Tue Feb 20 19:24:38 2024] Memory cgroup out of memory: Killed process 166287 (provider-servic) total-vm:5322344kB, anon-rss:517112kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1660kB oom_score_adj:999

Env

provider: sg.lneq

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                        IMAGE
akash-hostname-operator-6795445db-jf46g     ghcr.io/akash-network/provider:0.4.8
akash-inventory-operator-544c75d855-qs8lh   ghcr.io/akash-network/provider:0.4.8
akash-node-1-0                              ghcr.io/akash-network/node:0.30.0
akash-provider-0                            ghcr.io/akash-network/provider:0.4.8
$ kubectl get nodes -o wide
NAME    STATUS   ROLES           AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
node1   Ready    control-plane   71m   v1.28.6   10.74.43.129   <none>        Ubuntu 22.04.4 LTS   6.5.0-18-generic   containerd://1.7.11
node2   Ready    control-plane   71m   v1.28.6   10.74.43.133   <none>        Ubuntu 22.04.4 LTS   6.5.0-18-generic   containerd://1.7.11
node3   Ready    <none>          69m   v1.28.6   10.74.43.131   <none>        Ubuntu 22.04.4 LTS   6.5.0-18-generic   containerd://1.7.11
node4   Ready    <none>          69m   v1.28.6   10.8.68.129    <none>        Ubuntu 22.04.4 LTS   6.5.0-18-generic   containerd://1.7.11
$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph status
  cluster:
    id:     69d6af8d-3dfa-47cd-8f6e-bcbc5320987f
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 42m)
    mgr: b(active, since 39m), standbys: a
    osd: 8 osds: 8 up (since 40m), 8 in (since 40m)

  data:
    pools:   2 pools, 257 pgs
    objects: 7 objects, 577 KiB
    usage:   4.9 GiB used, 28 TiB / 28 TiB avail
    pgs:     257 active+clean

$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         27.94470  root default                             
-5          6.98618      host node1                           
 2   nvme   3.49309          osd.2       up   1.00000  1.00000
 5   nvme   3.49309          osd.5       up   1.00000  1.00000
-9          6.98618      host node2                           
 6   nvme   3.49309          osd.6       up   1.00000  1.00000
 7   nvme   3.49309          osd.7       up   1.00000  1.00000
-3          6.98618      host node3                           
 1   nvme   3.49309          osd.1       up   1.00000  1.00000
 4   nvme   3.49309          osd.4       up   1.00000  1.00000
-7          6.98618      host node4                           
 0   nvme   3.49309          osd.0       up   1.00000  1.00000
 3   nvme   3.49309          osd.3       up   1.00000  1.00000

Observation: 3 vs 4 nodes

Interestingly, comparing this provider sg.lneq to sg.lnlm, the latter doesn't experience this issue:

$ kubectl -n akash-services get pods
NAME                                        READY   STATUS    RESTARTS   AGE
akash-hostname-operator-6795445db-5xhq5     1/1     Running   0          5d7h
akash-inventory-operator-75d7758b86-gh2wj   1/1     Running   0          5d6h
akash-node-1-0                              1/1     Running   0          5d7h
akash-provider-0                            1/1     Running   0          21h

$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph status
  cluster:
    id:     661a3fe0-5ff2-4575-a421-f812501f463c
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 5d)
    mgr: a(active, since 5d), standbys: b
    osd: 6 osds: 6 up (since 5d), 6 in (since 5d)

  data:
    pools:   2 pools, 257 pgs
    objects: 491 objects, 889 MiB
    usage:   9.8 GiB used, 5.2 TiB / 5.2 TiB avail
    pgs:     257 active+clean

  io:
    client:   341 B/s wr, 0 op/s rd, 0 op/s wr

$ kubectl -n rook-ceph exec -i $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         5.23975  root default                             
-3         1.74658      host node1                           
 0   nvme  0.87329          osd.0       up   1.00000  1.00000
 2   nvme  0.87329          osd.2       up   1.00000  1.00000
-5         1.74658      host node2                           
 1   nvme  0.87329          osd.1       up   1.00000  1.00000
 4   nvme  0.87329          osd.4       up   1.00000  1.00000
-7         1.74658      host node3                           
 3   nvme  0.87329          osd.3       up   1.00000  1.00000
 5   nvme  0.87329          osd.5       up   1.00000  1.00000

$ kubectl get nodes -o wide
NAME    STATUS   ROLES           AGE    VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
node1   Ready    control-plane   5d8h   v1.28.6   192.168.0.100   <none>        Ubuntu 22.04.3 LTS   5.15.0-94-generic   containerd://1.7.11
node2   Ready    control-plane   5d8h   v1.28.6   192.168.0.101   <none>        Ubuntu 22.04.3 LTS   5.15.0-94-generic   containerd://1.7.11
node3   Ready    <none>          5d8h   v1.28.6   192.168.0.102   <none>        Ubuntu 22.04.3 LTS   5.15.0-94-generic   containerd://1.7.11

Next steps

I've lifted the RAM limit to 1GiB to see if it helps.