akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

provider reports excessively high amount of Allocatable cpu & ram when inventory operator hits an ERROR #192

Closed andy108369 closed 2 months ago

andy108369 commented 3 months ago

akash network 0.30.0 provider 0.5.4

Observation

  1. I've installed nvdp/nvidia-device-plugin helm-chart by mistake and then removed after short time:

    helm upgrade --install nvdp nvdp/nvidia-device-plugin   --namespace nvidia-device-plugin   --create-namespace   --version 0.14.5   --set runtimeClassName="nvidia"   --set deviceListStrategy=volume-mounts
  2. sometimes provider will report excessively large amount of Allocatable cpu & ram

I reinstalled the operator-inventory, it helped at the first look. However, after some time I've noticed the issue appeared again:

PROVIDER INFO
"hostname"                    "address"
"provider.sg.lnlm.akash.pub"  "akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs"

Total/Allocatable/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"                               "gpu(t/a/u)"  "mem(t/a/u GiB)"                         "ephemeral(t/a/u GiB)"
"node1"  "64/18446744073708244/-18446744073708180"  "0/0/0"       "251.45/17179863965.33/-17179863713.88"  "395.37/395.37/0"
"node2"  "64/18446744073708440/-18446744073708376"  "0/0/0"       "251.45/17179864746.7/-17179864495.25"   "395.37/395.37/0"
"node3"  "64/18446744073708440/-18446744073708376"  "0/0/0"       "251.45/17179864745.56/-17179864494.11"  "395.37/395.37/0"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
31.5          0      126         0                 0             0             31.5

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          1661.95

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

Additionally, I've noticed this error in the operator-inventory, but soon figured that it doesn't seem to be the cause comparing to the other providers which seen the same error in their inventory operator:

$ kubectl -n akash-services logs deployment/operator-inventory | grep -v 'MODIFIED monitoring CephCluster'
...
ERROR   watcher.registry    couldn't query pci.ids  {"error": "Get \"\": unsupported protocol scheme \"\""}
...

Provider logs

sg.lnlm.provider.log

Detailed info (8443/status)

sg.lnlm.provider-info-detailed.log

Additional observations

andy108369 commented 3 months ago

sg.lnlm - provider after 16 hours of uptime

NAME    STATUS   ROLES           AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
node1   Ready    control-plane   26d   v1.28.6   192.168.0.100   <none>        Ubuntu 22.04.4 LTS   5.15.0-97-generic   containerd://1.7.13
node2   Ready    control-plane   26d   v1.28.6   192.168.0.101   <none>        Ubuntu 22.04.4 LTS   5.15.0-97-generic   containerd://1.7.13
node3   Ready    <none>          26d   v1.28.6   192.168.0.102   <none>        Ubuntu 22.04.4 LTS   5.15.0-97-generic   containerd://1.7.13

NAME               READY   STATUS    RESTARTS   AGE
akash-provider-0   1/1     Running   0          16h

akash-node-9.0.0                0.30.0
provider-9.1.2                  0.5.4
akash-hostname-operator-9.0.5   0.5.4
akash-inventory-operator-9.0.6  0.5.4
ingress-nginx-4.10.0            1.10.0
rook-ceph-v1.13.4               v1.13.4
rook-ceph-cluster-v1.13.4       v1.13.4

PROVIDER INFO
"hostname"                    "address"
"provider.sg.lnlm.akash.pub"  "akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs"

Total/Allocatable/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"                               "gpu(t/a/u)"  "mem(t/a/u GiB)"                         "ephemeral(t/a/u GiB)"
"node1"  "64/18446744073709356/-18446744073709292"  "0/0/0"       "251.45/17179868409.45/-17179868158.01"  "395.37/395.37/0"
"node2"  "64/18446744073709400/-18446744073709336"  "0/0/0"       "251.45/17179868584.58/-17179868333.13"  "395.37/395.37/0"
"node3"  "64/18446744073709400/-18446744073709336"  "0/0/0"       "251.45/17179868583.06/-17179868331.61"  "395.37/395.37/0"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
31.5          0      126         0                 0             0             31.5

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          1661.88

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
$ kubectl -n akash-services logs operator-inventory-bb568b575-dtflg |grep -v 'MODIFIED monitoring CephCluster'
I[2024-03-12|18:07:29.163] using in cluster kube config                 cmp=provider
INFO    rook-ceph      ADDED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
INFO    watcher.storageclasses  started
INFO    nodes.nodes waiting for nodes to finish
INFO    grpc listening on ":8081"
INFO    watcher.config  started
INFO    rest listening on ":8080"
INFO    rook-ceph      ADDED monitoring StorageClass    {"name": "beta3"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node2"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node3"}
INFO    nodes.node.monitor  starting    {"node": "node2"}
INFO    nodes.node.monitor  starting    {"node": "node1"}
INFO    nodes.node.monitor  starting    {"node": "node3"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node1"}
INFO    rancher    ADDED monitoring StorageClass    {"name": "beta3"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node3"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node2"}
ERROR   nodes.node.monitor  unable to query cpu {"error": "error trying to reach service: dial tcp 10.233.75.4:8081: connect: invalid argument"}
ERROR   nodes.node.monitor  unable to query gpu {"error": "error trying to reach service: dial tcp 10.233.75.4:8081: connect: invalid argument"}
INFO    nodes.node.monitor  started {"node": "node2"}
INFO    nodes.node.monitor  started {"node": "node3"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node1"}
INFO    nodes.node.monitor  started {"node": "node1"}
$ kubectl -n akash-services get pods  -o wide
NAME                                          READY   STATUS    RESTARTS      AGE   IP               NODE    NOMINATED NODE   READINESS GATES
akash-node-1-0                                1/1     Running   1 (12d ago)   27d   10.233.71.36     node3   <none>           <none>
akash-provider-0                              1/1     Running   0             20h   10.233.71.58     node3   <none>           <none>
operator-hostname-cdb556d74-x9kb6             1/1     Running   0             8d    10.233.102.158   node1   <none>           <none>
operator-inventory-bb568b575-dtflg            1/1     Running   0             20h   10.233.75.5      node2   <none>           <none>
operator-inventory-hardware-discovery-node1   1/1     Running   0             20h   10.233.102.143   node1   <none>           <none>
operator-inventory-hardware-discovery-node2   1/1     Running   0             20h   10.233.75.4      node2   <none>           <none>
operator-inventory-hardware-discovery-node3   1/1     Running   0             20h   10.233.71.50     node3   <none>           <none>
$ kubectl -n akash-services logs operator-inventory-hardware-discovery-node2
listening on :8081
$ 

Logs


recovered after operator-inventory restart

$ kubectl rollout restart deployment/operator-inventory -n akash-services
deployment.apps/operator-inventory restarted

$ kubectl -n akash-services get pods  -o wide
NAME                                          READY   STATUS    RESTARTS      AGE   IP               NODE    NOMINATED NODE   READINESS GATES
akash-node-1-0                                1/1     Running   1 (12d ago)   27d   10.233.71.36     node3   <none>           <none>
akash-provider-0                              1/1     Running   0             20h   10.233.71.58     node3   <none>           <none>
operator-hostname-cdb556d74-x9kb6             1/1     Running   0             8d    10.233.102.158   node1   <none>           <none>
operator-inventory-7b5cb44f6c-9w5dn           1/1     Running   0             5s    10.233.75.32     node2   <none>           <none>
operator-inventory-hardware-discovery-node1   1/1     Running   0             3s    10.233.102.187   node1   <none>           <none>
operator-inventory-hardware-discovery-node2   1/1     Running   0             3s    10.233.75.8      node2   <none>           <none>
operator-inventory-hardware-discovery-node3   1/1     Running   0             3s    10.233.71.45     node3   <none>           <none>

$ kubectl -n akash-services logs deployment/operator-inventory -f | grep -v rook
I[2024-03-13|14:34:08.008] using in cluster kube config                 cmp=provider
INFO    nodes.nodes waiting for nodes to finish
INFO    rest listening on ":8080"
INFO    watcher.storageclasses  started
INFO    watcher.config  started
INFO    grpc listening on ":8081"
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node1"}
INFO    nodes.node.monitor  starting    {"node": "node2"}
INFO    nodes.node.monitor  starting    {"node": "node3"}
INFO    nodes.node.monitor  starting    {"node": "node1"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node2"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node3"}
INFO    rancher    ADDED monitoring StorageClass    {"name": "beta3"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node1"}
ERROR   nodes.node.monitor  unable to query cpu {"error": "error trying to reach service: dial tcp 10.233.102.187:8081: connect: connection refused"}
ERROR   nodes.node.monitor  unable to query gpu {"error": "error trying to reach service: dial tcp 10.233.102.187:8081: connect: connection refused"}
INFO    nodes.node.monitor  started {"node": "node1"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node2"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node3"}
INFO    nodes.node.monitor  started {"node": "node3"}
INFO    nodes.node.monitor  started {"node": "node2"}

recovered:

$ provider_info2.sh provider.sg.lnlm.akash.pub
PROVIDER INFO
"hostname"                    "address"
"provider.sg.lnlm.akash.pub"  "akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs"

Total/Allocatable/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"        "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"node1"  "64/46.53/17.47"    "0/0/0"       "251.45/193.45/57.99"  "395.37/395.37/0"
"node2"  "64/46.6/17.4"      "0/0/0"       "251.45/198.58/52.87"  "395.37/395.37/0"
"node3"  "64/45.995/18.005"  "0/0/0"       "251.45/197.06/54.39"  "395.37/395.37/0"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
31.5          0      126         0                 0             0             31.5

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          1663.24

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
andy108369 commented 3 months ago

clue 1

mon.obl provider reports excessively large amount of gpu for node2 which was image

there was network attack to this provider earlier today and node2 was powered off for unknown reason.

Here is the current state:

$ kubectl -n akash-services get pods -l app.kubernetes.io/name=inventory
NAME                                           READY   STATUS    RESTARTS      AGE
operator-inventory-bb568b575-mmcjp             1/1     Running   2 (18h ago)   2d3h
operator-inventory-hardware-discovery-node1    1/1     Running   0             18h
operator-inventory-hardware-discovery-node10   1/1     Running   0             18h
operator-inventory-hardware-discovery-node11   1/1     Running   0             18h
operator-inventory-hardware-discovery-node12   1/1     Running   0             18h
operator-inventory-hardware-discovery-node13   1/1     Running   0             18h
operator-inventory-hardware-discovery-node14   1/1     Running   0             18h
operator-inventory-hardware-discovery-node15   1/1     Running   0             18h
operator-inventory-hardware-discovery-node16   1/1     Running   0             18h
operator-inventory-hardware-discovery-node2    1/1     Running   0             6h15m
operator-inventory-hardware-discovery-node3    1/1     Running   0             18h
operator-inventory-hardware-discovery-node4    1/1     Running   0             18h
operator-inventory-hardware-discovery-node5    1/1     Running   0             18h
operator-inventory-hardware-discovery-node6    1/1     Running   0             18h
operator-inventory-hardware-discovery-node7    1/1     Running   0             18h
operator-inventory-hardware-discovery-node8    1/1     Running   0             18h
operator-inventory-hardware-discovery-node9    1/1     Running   0             18h
$ kubectl -n akash-services logs deployment/operator-inventory | grep -v 'MODIFIED monitoring CephCluster'
I[2024-03-14|04:22:51.569] using in cluster kube config                 cmp=provider
INFO    nodes.nodes waiting for nodes to finish
INFO    watcher.storageclasses  started
INFO    rest listening on ":8080"
INFO    grpc listening on ":8081"
INFO    watcher.config  started
INFO    nodes.node.monitor  starting    {"node": "node10"}
INFO    nodes.node.monitor  starting    {"node": "node1"}
INFO    nodes.node.monitor  starting    {"node": "node12"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node11"}
INFO    nodes.node.monitor  starting    {"node": "node11"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node1"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node12"}
INFO    nodes.node.monitor  starting    {"node": "node14"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node13"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node14"}
INFO    nodes.node.monitor  starting    {"node": "node13"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node10"}
INFO    nodes.node.monitor  starting    {"node": "node16"}
INFO    nodes.node.monitor  starting    {"node": "node15"}
INFO    nodes.node.monitor  starting    {"node": "node2"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node2"}
INFO    nodes.node.monitor  starting    {"node": "node3"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node3"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node16"}
INFO    nodes.node.monitor  starting    {"node": "node4"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node4"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node15"}
INFO    nodes.node.monitor  starting    {"node": "node5"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node5"}
INFO    nodes.node.monitor  starting    {"node": "node6"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node6"}
INFO    nodes.node.monitor  starting    {"node": "node7"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node7"}
INFO    nodes.node.monitor  starting    {"node": "node9"}
INFO    nodes.node.monitor  starting    {"node": "node8"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node9"}
INFO    nodes.node.discovery    starting hardware discovery pod {"node": "node8"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node10"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node3"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node9"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node5"}
INFO    nodes.node.monitor  started {"node": "node9"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node14"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node13"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node7"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node1"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node4"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node16"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node11"}
INFO    nodes.node.monitor  started {"node": "node13"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node8"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node12"}
INFO    nodes.node.monitor  started {"node": "node11"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node6"}
INFO    nodes.node.monitor  started {"node": "node7"}
INFO    nodes.node.monitor  started {"node": "node1"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node15"}
INFO    nodes.node.monitor  started {"node": "node10"}
INFO    nodes.node.monitor  started {"node": "node12"}
INFO    nodes.node.monitor  started {"node": "node3"}
INFO    nodes.node.monitor  started {"node": "node15"}
INFO    nodes.node.monitor  started {"node": "node14"}
INFO    nodes.node.monitor  started {"node": "node6"}
INFO    nodes.node.monitor  started {"node": "node4"}
INFO    nodes.node.monitor  started {"node": "node8"}
INFO    nodes.node.monitor  started {"node": "node5"}
INFO    nodes.node.monitor  started {"node": "node16"}
ERROR   watcher.registry    couldn't query inventory registry   {"error": "Get \"https://provider-configs.akash.network/devices/gpus\": read tcp 10.233.74.86:39682->172.64.80.1:443: read: connection reset by peer"}
ERROR   watcher.registry    couldn't query inventory registry   {"error": "Get \"https://provider-configs.akash.network/devices/gpus\": dial tcp: lookup provider-configs.akash.network on 169.254.25.10:53: read udp 10.233.74.86:58858->169.254.25.10:53: i/o timeout"}
ERROR   watcher.registry    couldn't query inventory registry   {"error": "Get \"https://provider-configs.akash.network/devices/gpus\": read tcp 10.233.74.86:58328->172.64.80.1:443: read: connection reset by peer"}
INFO    nodes.node.discovery    started hardware discovery pod  {"node": "node2"}
INFO    nodes.node.monitor  started {"node": "node2"}

After bouncing the inventory-operator it normalized:

image

the clue

It seems that the nvdp-nvidia-device-plugin-dgfdg did not have enough time to fully initialize before operator-inventory-hardware-discovery-node2 would assess the amount of GPUs available on the node2.

deathlessdd commented 3 months ago

Im also having some weird issues. When This happens I cannot bid for gpus on a different node. Fixing it requires bouncing of the operator-inventory

PROVIDER INFO
"hostname"                    "address"
"provider.pcgameservers.com"  "akash17l0f3kf7gv4kmgqjmgc0ksj3em6lqgcc4kl4dg"

Total/Allocatable/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"           "gpu(t/a/u)"  "mem(t/a/u GiB)"                        "ephemeral(t/a/u GiB)"
"node1"  "8/5.88/2.12"          "0/0/0"       "7.51/5.87/1.64"                        "43.13/43.13/0"
"node2"  "48/38.575/9.425"      "4/0/4"       "115.12/66.07/49.04"                    "586.82/536.82/50"
"node3"  "128/111.025/16.975"   "0/0/0"       "143.76/96.29/47.47"                    "352.06/193.06/159"
"node4"  "128/44.145/83.855"    "1/1/0"       "52.58/17179869145.17/-17179869092.59"  "290.06/110.44/179.62"
"node5"  "8/3.825/4.175"        "2/1/1"       "52.57/36.05/16.52"                     "453.94/412.03/41.91"
"node6"  "32/18.425/13.575"     "1/0/1"       "47.8/30.14/17.66"                      "175.12/155.12/20"
"node7"  "256/132.275/123.725"  "3/1/2"       "288.16/204.36/83.8"                    "352.06/287.56/64.5"
andy108369 commented 3 months ago

Narrowing the issue down, based on the providers uptime (~5 days) - it appears that only providers that have or had nvdp/nvidia-device-plugin installed are experiencing this issue.

andy108369 commented 3 months ago

Couple of additional observations:

  1. whenever I reboot a worker node that has GPU resources, it would more than often (if not always) report excessive amount of allocatable resources unless I restart the inventory operator (kubectl rollout restart deployment/operator-inventory -n akash-services)
  2. most of the times there is this error in the inventory operator logs, after which I would see provider to report excessive amount of allocatable CPU resources:
    ERROR   watcher.registry    couldn't query pci.ids  {"error": "Get \"\": unsupported protocol scheme \"\""}
andy108369 commented 3 months ago

Now the Hurricane provider keeps always reporting 18446744073709524 allocatable cpu's even after I restart inventory-operator which usually temporarily fixed the issue, until now.

deathlessdd commented 3 months ago

New issue after restarting a worker node node5 which had akash-provider-0 and Operator-Inventory running on. I have 11 gpus total, says 7 active but inventory says all 11gpus are used. 0 gpus are pending. Fix by bouncing akash-provider-0 and operator-inventory. Then the inventory started to show correctly again.

PROVIDER INFO
"hostname"                    "address"
"provider.pcgameservers.com"  "akash17l0f3kf7gv4kmgqjmgc0ksj3em6lqgcc4kl4dg"

Total/Allocatable/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"           "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"node1"  "8/0.38/7.62"          "0/0/0"       "7.51/0.37/7.14"       "43.13/37.63/5.5"
"node2"  "48/15.575/32.425"     "4/0/4"       "115.12/40.12/74.99"   "586.82/248.06/338.76"
"node3"  "128/114.025/13.975"   "0/0/0"       "143.76/97.89/45.87"   "352.06/249.93/102.13"
"node4"  "128/108.795/19.205"   "1/0/1"       "52.58/12.05/40.52"    "290.06/97.62/192.44"
"node5"  "8/0.525/7.475"        "2/0/2"       "52.57/27.37/25.2"     "453.94/284.24/169.7"
"node6"  "32/18.425/13.575"     "1/0/1"       "47.8/30.14/17.66"     "175.12/155.12/20"
"node7"  "256/108.275/147.725"  "3/0/3"       "288.16/161.36/126.8"  "352.06/217.56/134.5"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
166.5         7      174.31      290.07            0             0             52.5

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          356.66

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
15            0      0.5         0.5               0             0             0

Inventory completely stopped working.

{"cluster":{"leases":12,"inventory":{"active":[{"cpu":4000,"gpu":4,"memory":37580963840,"storage_ephemeral":53687091200},{"cpu":1500,"gpu":0,"memory":5368709120,"storage_ephemeral":8589934592},{"cpu":1000,"gpu":0,"memory":2147483648,"storage_ephemeral":1073741824,"storage":{"beta3":1073741824}},{"cpu":2000,"gpu":0,"memory":16000000000,"storage_ephemeral":100000000000},{"cpu":128000,"gpu":0,"memory":34359738368,"storage_ephemeral":32212254720},{"cpu":4000,"gpu":0,"memory":12884901888,"storage_ephemeral":1610612736},{"cpu":1000,"gpu":0,"memory":8000000000,"storage_ephemeral":30000000000},{"cpu":2000,"gpu":2,"memory":37580963840,"storage_ephemeral":53687091200},{"cpu":4000,"gpu":0,"memory":8589934592,"storage_ephemeral":2147483648,"storage":{"beta3":10737418240}},{"cpu":12000,"gpu":1,"memory":17179869184,"storage_ephemeral":21474836480},{"cpu":5000,"gpu":0,"memory":5368709120,"storage_ephemeral":5368709120,"storage":{"beta3":42949672960}},{"cpu":2000,"gpu":0,"memory":2097741824,"storage_ephemeral":1610612736,"storage":{"beta3":1610612736}}],"available":{"nodes":[{"name":"node1","allocatable":{"cpu":8000,"gpu":0,"memory":8068288512,"storage_ephemeral":46314425473},"available":{"cpu":380,"gpu":0,"memory":400015360,"storage_ephemeral":40408845441}},{"name":"node2","allocatable":{"cpu":48000,"gpu":4,"memory":123604434944,"storage_ephemeral":630096038893},"available":{"cpu":15575,"gpu":0,"memory":43081193472,"storage_ephemeral":266350410733}},{"name":"node3","allocatable":{"cpu":128000,"gpu":0,"memory":154365534208,"storage_ephemeral":378025411573},"available":{"cpu":114025,"gpu":0,"memory":105111463936,"storage_ephemeral":268361735157}},{"name":"node4","allocatable":{"cpu":128000,"gpu":1,"memory":56455852032,"storage_ephemeral":311444659299},"available":{"cpu":108795,"gpu":0,"memory":12943570944,"storage_ephemeral":104814129251}},{"name":"node5","allocatable":{"cpu":8000,"gpu":2,"memory":56443244544,"storage_ephemeral":487414664409},"available":{"cpu":525,"gpu":0,"memory":29388990464,"storage_ephemeral":305202409689}},{"name":"node6","allocatable":{"cpu":32000,"gpu":1,"memory":51326119936,"storage_ephemeral":188036982064},"available":{"cpu":18425,"gpu":0,"memory":32362043392,"storage_ephemeral":166562145584}},{"name":"node7","allocatable":{"cpu":256000,"gpu":3,"memory":309405798400,"storage_ephemeral":378025411573},"available":{"cpu":108275,"gpu":0,"memory":173257062400,"storage_ephemeral":233607136245}}],"storage":[{"class":"beta3","size":382961909760}]}}},"bidengine":{"orders":0},"manifest":{"deployments":0},"cluster_public_hostname":"provider.pcgameservers.com","address":"akash17l0f3kf7gv4kmgqjmgc0ksj3em6lqgcc4kl4dg"}

After starting both services for "akash-provider-0, Operatory-inventory"

PROVIDER INFO
"hostname"                    "address"
"provider.pcgameservers.com"  "akash17l0f3kf7gv4kmgqjmgc0ksj3em6lqgcc4kl4dg"

Total/Allocatable/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"           "gpu(t/a/u)"  "mem(t/a/u GiB)"      "ephemeral(t/a/u GiB)"
"node1"  "8/5.38/2.62"          "0/0/0"       "7.51/5.62/1.89"      "43.13/43.13/0"
"node2"  "48/38.575/9.425"      "4/0/4"       "115.12/66.07/49.04"  "586.82/536.82/50"
"node3"  "128/114.025/13.975"   "0/0/0"       "143.76/97.89/45.87"  "352.06/249.93/102.13"
"node4"  "128/93.795/34.205"    "1/1/0"       "52.58/7.55/45.02"    "290.06/227.62/62.44"
"node5"  "8/6.025/1.975"        "2/2/0"       "55.45/53.45/2"       "453.94/453.94/0"
"node6"  "32/18.425/13.575"     "1/0/1"       "47.8/30.14/17.66"    "175.12/155.12/20"
"node7"  "256/114.275/141.725"  "3/1/2"       "288.16/196.36/91.8"  "352.06/267.56/84.5"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
166.5         7      174.31      290.07            0             0             52.5

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          396.65

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
andy108369 commented 3 months ago

Now the Hurricane provider keeps always reporting 18446744073709524 allocatable cpu's even after I restart inventory-operator which usually temporarily fixed the issue, until now.

Fixed the Hurricane reporting. Possibly it was caused by some of the deployments in Failed state. I've cleaned them up after which reporting looks good:

arno@x1:~$ kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o wide --field-selector status.phase=Failed 
NAMESPACE                                       NAME                           READY   STATUS                   RESTARTS        AGE     IP       NODE                   NOMINATED NODE   READINESS GATES
hg49sq80mpk3e7q7m43asnrhe1tu9639usr0psr7fkq7m   miner-xmrig-7877f4f8d9-9txlz   0/1     Error                    1               17d     <none>   worker-01.hurricane2   <none>           <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-f7785f6c6-p6m69            0/1     ContainerStatusUnknown   2 (13d ago)     14d     <none>   worker-01.hurricane2   <none>           <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-f7785f6c6-twxrr            0/1     ContainerStatusUnknown   1 (11d ago)     11d     <none>   worker-01.hurricane2   <none>           <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-f7785f6c6-5dgbl            0/1     ContainerStatusUnknown   2 (6d22h ago)   7d16h   <none>   worker-01.hurricane2   <none>           <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-f7785f6c6-95b2k            0/1     ContainerStatusUnknown   1               5d6h    <none>   worker-01.hurricane2   <none>           <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-f7785f6c6-9nh5b            0/1     ContainerStatusUnknown   1               4d17h   <none>   worker-01.hurricane2   <none>           <none>

arno@x1:~$ kubectl -n 2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu get rs
NAME            DESIRED   CURRENT   READY   AGE
web-f7785f6c6   1         1         1       14d

arno@x1:~$ kubectl -n 2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu get pods
NAME                  READY   STATUS                   RESTARTS        AGE
web-f7785f6c6-2r2j6   0/1     Completed                0               2d11h
web-f7785f6c6-4m462   0/1     Completed                2 (5d18h ago)   6d10h
web-f7785f6c6-5dgbl   0/1     ContainerStatusUnknown   2 (6d22h ago)   7d16h
web-f7785f6c6-95b2k   0/1     ContainerStatusUnknown   1               5d6h
web-f7785f6c6-9nh5b   0/1     ContainerStatusUnknown   1               4d17h
web-f7785f6c6-dsjp8   0/1     Completed                7 (8d ago)      11d
web-f7785f6c6-fl49h   0/1     Completed                0               3d4h
web-f7785f6c6-g2sfx   0/1     Completed                2 (12d ago)     12d
web-f7785f6c6-j2prf   1/1     Running                  4 (86m ago)     2d1h
web-f7785f6c6-p6m69   0/1     ContainerStatusUnknown   2 (13d ago)     14d
web-f7785f6c6-pk98k   0/1     Completed                3 (3d13h ago)   4d5h
web-f7785f6c6-q89gg   0/1     Completed                0               8d
web-f7785f6c6-twxrr   0/1     ContainerStatusUnknown   1 (11d ago)     11d
web-f7785f6c6-z8w9f   0/1     Completed                0               2d20h

arno@x1:~$ kubectl -n hg49sq80mpk3e7q7m43asnrhe1tu9639usr0psr7fkq7m get rs
NAME                     DESIRED   CURRENT   READY   AGE
miner-xmrig-7877f4f8d9   1         1         1       21d

arno@x1:~$ kubectl -n hg49sq80mpk3e7q7m43asnrhe1tu9639usr0psr7fkq7m get pods
NAME                           READY   STATUS    RESTARTS      AGE
miner-xmrig-7877f4f8d9-8mfmt   1/1     Running   1 (86m ago)   7d15h
miner-xmrig-7877f4f8d9-9txlz   0/1     Error     1             17d

arno@x1:~$ kubectl delete pods -A --field-selector status.phase=Failed 
pod "web-f7785f6c6-5dgbl" deleted
pod "web-f7785f6c6-95b2k" deleted
pod "web-f7785f6c6-9nh5b" deleted
pod "web-f7785f6c6-p6m69" deleted
pod "web-f7785f6c6-twxrr" deleted
pod "miner-xmrig-7877f4f8d9-9txlz" deleted

arno@x1:~$ kubectl -n hg49sq80mpk3e7q7m43asnrhe1tu9639usr0psr7fkq7m get pods
NAME                           READY   STATUS    RESTARTS      AGE
miner-xmrig-7877f4f8d9-8mfmt   1/1     Running   1 (86m ago)   7d15h

arno@x1:~$ kubectl -n 2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu get pods
NAME                  READY   STATUS      RESTARTS        AGE
web-f7785f6c6-2r2j6   0/1     Completed   0               2d11h
web-f7785f6c6-4m462   0/1     Completed   2 (5d18h ago)   6d10h
web-f7785f6c6-dsjp8   0/1     Completed   7 (8d ago)      11d
web-f7785f6c6-fl49h   0/1     Completed   0               3d4h
web-f7785f6c6-g2sfx   0/1     Completed   2 (12d ago)     12d
web-f7785f6c6-j2prf   1/1     Running     4 (86m ago)     2d1h
web-f7785f6c6-pk98k   0/1     Completed   3 (3d13h ago)   4d5h
web-f7785f6c6-q89gg   0/1     Completed   0               8d
web-f7785f6c6-z8w9f   0/1     Completed   0               2d20h
$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Allocatable/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"         "gpu(t/a/u)"  "mem(t/a/u GiB)"      "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/1.2/0.8"          "0/0/0"       "1.82/1.69/0.13"      "25.54/25.54/0"
"worker-01.hurricane2"   "102/18.795/83.205"  "1/1/0"       "196.45/102.2/94.25"  "1808.76/1548.18/260.58"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
63            0      59.73       236.68            0             0             0

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          276.44

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
1.8           0      8           10                0             0             0
andy108369 commented 3 months ago

it appears to be still an issue with provider-services v0.5.9 & v0.5.11:

$ curl -s -k https://provider.sg.lnlm.akash.pub:8443/status | jq -r . | grep -C1 1844
            "available": {
              "cpu": 18446744073709520000,
              "gpu": 0,
              "memory": 18446743950448112000,
              "storage_ephemeral": 424525602114
--
            "available": {
              "cpu": 18446744073709496000,
              "gpu": 0,
              "memory": 18446743844816663000,
              "storage_ephemeral": 424525602114
--
            "available": {
              "cpu": 18446744073709537000,
              "gpu": 0,
              "memory": 18446744023303657000,
              "storage_ephemeral": 424525602114
andy108369 commented 3 months ago

(pdx.nb.akash.pub 4090s provider) It appears that this has something to do with the failing pods too. For instance, 3 nodes with 8x 4090's each.

The node1,node2 had wrong nvidia.ko driver version installed 550 instead of 535. I've reinstalled it while these deployments were running, restarted all 3 nodes.

This caused some pods stuck in ContainerStatusUnknown state:

$ kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o wide --field-selector status.phase=Failed 
NAMESPACE                                       NAME                         READY   STATUS                   RESTARTS   AGE   IP       NODE    NOMINATED NODE   READINESS GATES
3n3mvl6qqh1bkk41pou3dkfttkjglo6tua36udmh6n4fm   service-1-dd746bf44-4745c    0/1     ContainerStatusUnknown   0          43m   <none>   node1   <none>           <none>
1qhlsoi0sqj2rot1otov7vhfao2j0cnmbuvkj2qd16ese   service-1-76f8b9cf6d-dp7x2   0/1     ContainerStatusUnknown   0          33m   <none>   node2   <none>           <none>
va7h4phadmnfld29qd5rdtdsgk6eupf39d6etke05t0fu   service-1-7c59bdb7df-j586f   0/1     ContainerStatusUnknown   0          32m   <none>   node2   <none>           <none>
qg9lq6q8tcta1p2m9fuc1pdbjfispht8q7e7iun6t5s2e   service-1-59974dfd89-sgvjg   0/1     Unknown                  0          32m   <none>   node2   <none>           <none>
6ng2gu6vf5p8qg5bde5udse1e34igb1bn15kaeupiuhva   service-1-59c44cd758-mzvsj   0/1     Unknown                  0          30m   <none>   node2   <none>           <none>
d62adnou0v7b5s7h3t8gnh0av540fcok9bk56u72f3je2   service-1-7cffd45f48-w25vt   0/1     ContainerStatusUnknown   0          28m   <none>   node2   <none>           <none>
of3uincqjlja5ekk8cbfuormpm4dmn8v403c535f4dc4m   service-1-7dd5dffbc4-brgvg   0/1     ContainerStatusUnknown   0          25m   <none>   node1   <none>           <none>
rfij4esvggf9cqqnpf2hq266o0nba01t5iq918bu1v9iu   service-1-58bf676fdc-ph4f9   0/1     ContainerStatusUnknown   0          23m   <none>   node2   <none>           <none>
6pgohd98lm7gs5rb2kv5bnc4c9920jtvfvmg4ikqvhn8a   service-1-55858fc545-6fr5l   0/1     ContainerStatusUnknown   0          22m   <none>   node1   <none>           <none>
guf6r2fhenpfljbhncip9sbei3ss43av4kaau95kl4rpq   service-1-598c857c89-wtkvh   0/1     ContainerStatusUnknown   0          22m   <none>   node2   <none>           <none>
tsu3ue9housp0ehjsr51psu4aambpaqvtpuninpl07hqs   service-1-55fd66f6f5-2hfqv   0/1     ContainerStatusUnknown   0          21m   <none>   node1   <none>           <none>
qrka4mab6esns6blt8jaeos663j0e6sbp9cfghi02jc4i   service-1-84c67b446-5v55h    0/1     ContainerStatusUnknown   0          20m   <none>   node1   <none>           <none>
eogo7cjtebo8fr0g9l1mfmo17j5r3efi1hitlpki3g04g   service-1-84988c5fb6-8rhtq   0/1     ContainerStatusUnknown   0          20m   <none>   node1   <none>           <none>
g5i1ml6bhnfso9faglp1gv167f8acegsv335hlkq0dlfc   service-1-7d66cdd98c-572vx   0/1     ContainerStatusUnknown   0          18m   <none>   node1   <none>           <none>

Which in turn triggered this bug (see GPU count for node1,node2):

$ provider_info2.sh provider.pdx.nb.akash.pub
PROVIDER INFO
"hostname"                   "address"
"provider.pdx.nb.akash.pub"  "akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv"

Total/Available/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"          "gpu(t/a/u)"                                    "mem(t/a/u GiB)"        "ephemeral(t/a/u GiB)"
"node1"  "128/21.38/106.62"    "0/18446744073709552000/-18446744073709552000"  "503.61/401.45/102.16"  "6385.77/5515.21/870.55"
"node2"  "128/24.12/103.88"    "8/18446744073709552000/-18446744073709552000"  "503.61/375.96/127.65"  "6385.77/4059.33/2326.43"
"node3"  "128/22.425/105.575"  "8/1/7"                                         "503.61/400.99/102.62"  "6385.77/5515.21/870.55"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
149           14     152.74      2033.77           0             0             0

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          1699.04

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

I've deleted those that stuck in ContainerStatusUnknown and stats immediately recovered:

kubectl delete pods -A --field-selector status.phase=Failed 
$ provider_info2.sh provider.pdx.nb.akash.pub
PROVIDER INFO
"hostname"                   "address"
"provider.pdx.nb.akash.pub"  "akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv"

Total/Available/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"          "gpu(t/a/u)"  "mem(t/a/u GiB)"        "ephemeral(t/a/u GiB)"
"node1"  "128/57.38/70.62"     "8/4/4"       "503.61/434.97/68.64"   "6385.77/5938.73/447.03"
"node2"  "128/73.12/54.88"     "8/1/7"       "503.61/435.56/68.05"   "6385.77/5222.55/1163.22"
"node3"  "128/22.425/105.575"  "8/1/7"       "503.61/400.99/102.62"  "6385.77/5515.21/870.55"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
181           16     182.54      2257.29           0             0             0

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          1699.04

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
16            1      14.9        111.76            0             0             0
andy108369 commented 2 months ago

pdx.nb provider - issue happened in under 55 mins after operator-inventory restart

I think the pdx.nb provider is a good candidate to start monitoring the issue, since it appears this issue started to occur often after node1.pdx.nb.akash.pub node has been replaced yesterday (the mainboard, GPU's & ceph disk), except for the main OS disks (rootfs).

$ provider_info2.sh provider.pdx.nb.akash.pub
PROVIDER INFO
"hostname"                   "address"
"provider.pdx.nb.akash.pub"  "akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv"

Total/Available/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"         "gpu(t/a/u)"  "mem(t/a/u GiB)"        "ephemeral(t/a/u GiB)"
"node1"  "128/11.65/116.35"   "8/1/7"       "503.59/240.78/262.82"  "6385.77/5603.46/782.31"
"node2"  "128/39.95/88.05"    "8/0/8"       "503.61/418.51/85.1"    "6385.77/5138.73/1247.03"
"node3"  "128/6.325/121.675"  "8/0/8"       "503.61/368.93/134.68"  "6385.77/5403.46/982.31"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
356           26     583.64      3346.93           0             0             770

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          846.13

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
$ provider_info2.sh provider.pdx.nb.akash.pub
PROVIDER INFO
"hostname"                   "address"
"provider.pdx.nb.akash.pub"  "akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv"

Total/Available/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"                                "gpu(t/a/u)"                                    "mem(t/a/u GiB)"        "ephemeral(t/a/u GiB)"
"node1"  "128/18446744073709468/-18446744073709340"  "8/18446744073709552000/-18446744073709552000"  "503.59/16.78/486.82"   "6385.77/4932.9/1452.86"
"node2"  "128/39.95/88.05"                           "8/0/8"                                         "503.61/418.51/85.1"    "6385.77/5138.73/1247.03"
"node3"  "128/6.325/121.675"                         "8/0/8"                                         "503.61/368.93/134.68"  "6385.77/5403.46/982.31"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
356           26     583.64      3346.93           0             0             770

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          846.13

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
$ kubectl -n akash-services get pods 
NAME                                          READY   STATUS    RESTARTS        AGE
akash-node-1-0                                1/1     Running   2 (5d16h ago)   5d17h
akash-provider-0                              1/1     Running   0               15h
operator-hostname-574d8699d-c22w5             1/1     Running   3 (3d11h ago)   5d17h
operator-inventory-75df5b6fb5-2k897           1/1     Running   0               55m
operator-inventory-hardware-discovery-node1   1/1     Running   0               55m
operator-inventory-hardware-discovery-node2   1/1     Running   0               55m
operator-inventory-hardware-discovery-node3   1/1     Running   0               55m
$ kubectl -n akash-services logs deployment/operator-inventory --timestamps
2024-04-10T09:19:38.827373163Z I[2024-04-10|09:19:38.827] using in cluster kube config                 cmp=provider
2024-04-10T09:19:39.849250140Z INFO rook-ceph      ADDED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:19:39.874528936Z INFO nodes.nodes waiting for nodes to finish
2024-04-10T09:19:39.874554196Z INFO grpc listening on ":8081"
2024-04-10T09:19:39.874575036Z INFO watcher.storageclasses  started
2024-04-10T09:19:39.874577926Z INFO watcher.config  started
2024-04-10T09:19:39.874582506Z INFO rest listening on ":8080"
2024-04-10T09:19:39.877335127Z INFO rook-ceph      ADDED monitoring StorageClass    {"name": "beta3"}
2024-04-10T09:19:39.878384595Z INFO nodes.node.monitor  starting    {"node": "node1"}
2024-04-10T09:19:39.878393235Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "node1"}
2024-04-10T09:19:39.878405655Z INFO nodes.node.monitor  starting    {"node": "node2"}
2024-04-10T09:19:39.878415995Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "node2"}
2024-04-10T09:19:39.878422595Z INFO nodes.node.monitor  starting    {"node": "node3"}
2024-04-10T09:19:39.878454075Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "node3"}
2024-04-10T09:19:39.885778559Z INFO rancher    ADDED monitoring StorageClass    {"name": "beta3"}
2024-04-10T09:19:42.756543562Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "node3"}
2024-04-10T09:19:42.916743883Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "node2"}
2024-04-10T09:19:43.183058728Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "node1"}
2024-04-10T09:19:43.202991245Z INFO nodes.node.monitor  started {"node": "node2"}
2024-04-10T09:19:43.426359532Z INFO nodes.node.monitor  started {"node": "node1"}
2024-04-10T09:19:44.084855728Z INFO nodes.node.monitor  started {"node": "node3"}
2024-04-10T09:19:49.948809750Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:20:50.587788813Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:21:51.246518697Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:22:51.918597738Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:23:52.573545877Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:24:53.223854209Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:25:53.901694220Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:26:54.559292611Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:27:55.217895603Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:28:55.871364360Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:29:56.526719703Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:30:57.202289863Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:31:57.853717274Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:32:58.516702894Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:33:59.173653862Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:34:59.835751168Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:36:00.494909031Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:37:01.155779061Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:38:01.817805371Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:39:02.475386563Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:40:03.139422647Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:41:03.806580305Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:42:04.457760854Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:43:05.106685230Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:44:05.749807485Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:45:06.406388585Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:46:07.065620189Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:47:07.734040162Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:48:08.401818233Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:49:09.064330123Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:50:09.719065483Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:51:10.362797022Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:52:11.025336455Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:53:11.682972539Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:54:12.349148024Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:55:13.014021018Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:56:13.689351710Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:57:14.351208230Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:58:15.017461998Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:59:15.672108483Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:00:16.317829774Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:01:16.969684860Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:02:17.638128924Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:03:18.284770571Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:04:18.947803687Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:05:19.593905389Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:06:20.263411687Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:07:20.903124077Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:08:21.553959836Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:09:22.205702368Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:10:22.857968005Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:11:23.532570698Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:12:24.181899391Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:13:24.839654285Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:14:25.493307725Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}

this might be also the trigger:


NAMESPACE                                       LAST SEEN   TYPE      REASON              OBJECT                                            MESSAGE
sj264h0mg6bq9alvkqtnq69ubjd9ptq4ubdfkuc9i6rdm   14m         Warning   FailedScheduling    pod/service-1-0                                   0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
akash-services                                  60m         Normal    Scheduled           pod/operator-inventory-75df5b6fb5-2k897           Successfully assigned akash-services/operator-inventory-75df5b6fb5-2k897 to node2
eg0vmr4qmf9kdumohtdahhqq14aa3i5q1dutblo0jugc2   14m         Warning   FailedScheduling    pod/service-1-0                                   0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
eip4fok1c0g4eome40s2r4u3941at4sua6rlh842c07dq   14m         Warning   FailedScheduling    pod/service-1-0                                   0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..```

FWIW, `11/26` leases are using `beta3` persistent storage.
I know how to contact the owner of most of all the 26 leases on pdx.nb provider if needed.
andy108369 commented 2 months ago

sg.lnlm.akash.pub - issue got triggered

Looks like this triggered the "excessively large stats" issue on the sg.lnlm.akash.pub:

2024-04-11T08:10:21.154985407Z ERROR    watcher.registry    couldn't query pci.ids  {"error": "Get \"\": unsupported protocol scheme \"\""}

complete logs with the timestamps: sg.lnlm.akash.pub.deployment-operator-inventory.log

Additionally

there haven't been lease-created nor lease-closed for this provider in the past week.

however, there have been some bid-created & bid-closed events just today:

andy108369 commented 2 months ago

The provider 0.5.12 does not exhibit the issue of excessive resource reporting :rocket:

Next steps: