akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

provider status endpoint: hurricane provider reports excessively large amount of available CPUs #232

Open andy108369 opened 6 days ago

andy108369 commented 6 days ago

hurricane provider reports excessively large amount of available CPUs

$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
BALANCE: 405.635368
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Available/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"                                "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/1.2/0.8"                                 "0/0/0"       "1.82/1.69/0.13"       "25.54/25.54/0"
"worker-01.hurricane2"   "102/18446744073709504/-18446744073709404"  "1/1/0"       "196.45/57.48/138.97"  "1808.76/1443.1/365.67"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
34.2          0      64.88       314.4             0             0             11

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          575.7

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

provider_info2.sh script https://github.com/arno01/akash-tools/blob/main/provider_info2.sh

Versions

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                                          IMAGE
akash-node-1-0                                                ghcr.io/akash-network/node:0.36.0
akash-provider-0                                              ghcr.io/akash-network/provider:0.6.2
operator-hostname-6dddc6db79-hmmxd                            ghcr.io/akash-network/provider:0.6.2
operator-inventory-6fdf575d44-rnfj4                           ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-control-01.hurricane2   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-worker-01.hurricane2    ghcr.io/akash-network/provider:0.6.2
operator-ip-d9d6df8cd-t9zw9                                   ghcr.io/akash-network/provider:0.6.2

Logs

I've tried restarting the operator-inventory which previously used to "fix" this issue, but to no avail this time.

kubectl -n akash-services rollout restart deployment/operator-inventory
$ kubectl -n akash-services logs deployment/operator-inventory --timestamps
2024-06-27T15:25:29.979755238Z I[2024-06-27|15:25:29.979] using in cluster kube config                 cmp=provider
2024-06-27T15:25:30.993714193Z INFO rook-ceph      ADDED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:25:31.022718552Z INFO rest listening on ":8080"
2024-06-27T15:25:31.022730122Z INFO nodes.nodes waiting for nodes to finish
2024-06-27T15:25:31.022777911Z INFO grpc listening on ":8081"
2024-06-27T15:25:31.022824901Z INFO watcher.storageclasses  started
2024-06-27T15:25:31.022976338Z INFO watcher.config  started
2024-06-27T15:25:31.027880682Z INFO rook-ceph      ADDED monitoring StorageClass    {"name": "beta3"}
2024-06-27T15:25:31.029378292Z INFO nodes.node.monitor  starting    {"node": "worker-01.hurricane2"}
2024-06-27T15:25:31.029383612Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "control-01.hurricane2"}
2024-06-27T15:25:31.029386222Z INFO nodes.node.monitor  starting    {"node": "control-01.hurricane2"}
2024-06-27T15:25:31.029390481Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "worker-01.hurricane2"}
2024-06-27T15:25:31.063512161Z INFO rancher    ADDED monitoring StorageClass    {"name": "beta3"}
2024-06-27T15:25:31.066705538Z W0627 15:25:31.066598       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:31.066875795Z W0627 15:25:31.066601       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:32.087372741Z W0627 15:25:32.087218       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:32.087522389Z W0627 15:25:32.087456       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:33.093759624Z W0627 15:25:33.093649       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:33.096327860Z W0627 15:25:33.096250       7 warnings.go:70] metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-06-27T15:25:35.614448848Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:25:35.664476772Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "control-01.hurricane2"}
2024-06-27T15:25:35.780999348Z INFO nodes.node.monitor  started {"node": "control-01.hurricane2"}
2024-06-27T15:25:36.239976215Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "worker-01.hurricane2"}
2024-06-27T15:25:36.454713184Z INFO nodes.node.monitor  started {"node": "worker-01.hurricane2"}
2024-06-27T15:26:36.900875467Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:27:38.206330676Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:28:39.486188220Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-06-27T15:29:40.787165193Z INFO rook-ceph   MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
andy108369 commented 1 day ago

The cpu value returned back to normal even without the need to bump the opreator-inventory, after deleting containers stuck in "ContainerStatusUnknown" state:

$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
BALANCE: 408.364243
^R
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Available/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"                                "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/1.2/0.8"                                 "0/0/0"       "1.82/1.69/0.13"       "25.54/25.54/0"
"worker-01.hurricane2"   "102/18446744073709490/-18446744073709384"  "1/1/0"       "196.45/49.67/146.78"  "1808.76/1435.28/373.48"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
34.2          0      64.88       314.4             0             0             11

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          575.7

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
arno@x1:~$ kubectl get pods -A --field-selector status.phase=Failed 
NAMESPACE                                       NAME                   READY   STATUS                   RESTARTS   AGE
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-bbgqz   0/1     ContainerStatusUnknown   1          2d22h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-f2fpj   0/1     ContainerStatusUnknown   1          3d20h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-g7xbd   0/1     ContainerStatusUnknown   1          3d3h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-hv4qs   0/1     ContainerStatusUnknown   1          9h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-p4h7j   0/1     ContainerStatusUnknown   1          4d7h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-rcr45   0/1     ContainerStatusUnknown   1          30h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-4cq86   0/1     ContainerStatusUnknown   1          20d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-5ddrg   0/1     ContainerStatusUnknown   1          13d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-7nl6p   0/1     ContainerStatusUnknown   1          5d7h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-9jsn7   0/1     ContainerStatusUnknown   1          19d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-bnjfh   0/1     ContainerStatusUnknown   1          20d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-d2nfr   0/1     ContainerStatusUnknown   1          7d12h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-dk95v   0/1     ContainerStatusUnknown   1          17d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-fgfl4   0/1     ContainerStatusUnknown   1          7d19h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-gh9bb   0/1     ContainerStatusUnknown   1          16d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-gltgh   0/1     ContainerStatusUnknown   1          9d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-j9tnr   0/1     ContainerStatusUnknown   1          15d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-mmqfk   0/1     ContainerStatusUnknown   1          6d5h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-ph89h   0/1     ContainerStatusUnknown   1          11d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-pjrg4   0/1     ContainerStatusUnknown   1          17d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-pwbzv   0/1     ContainerStatusUnknown   1          13d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-rd7z5   0/1     ContainerStatusUnknown   1          12d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-t6vt9   0/1     ContainerStatusUnknown   1          6d15h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-vht5l   0/1     ContainerStatusUnknown   1          9d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-wd8w4   0/1     ContainerStatusUnknown   1          7d23h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-xnsvt   0/1     ContainerStatusUnknown   1          13d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-zmzbf   0/1     ContainerStatusUnknown   1          12d
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-7f5fdfd87c-zw2st   0/1     ContainerStatusUnknown   1          10d

arno@x1:~$ ns=2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu
arno@x1:~$ kubectl -n $ns get deployment
NAME   READY   UP-TO-DATE   AVAILABLE   AGE
web    1/1     1            1           116d
arno@x1:~$ kubectl -n $ns get rs
NAME             DESIRED   CURRENT   READY   AGE
web-57478ff56c   0         0         0       4d17h
web-5df9f7c798   1         1         1       4d17h
web-7f5fdfd87c   0         0         0       53d
web-85fc6b7694   0         0         0       54d
web-85ff75fdc5   0         0         0       70d
arno@x1:~$ kubectl -n $ns delete rs web-85ff75fdc5
replicaset.apps "web-85ff75fdc5" deleted
arno@x1:~$ kubectl -n $ns delete rs web-85fc6b7694
replicaset.apps "web-85fc6b7694" deleted
arno@x1:~$ kubectl -n $ns delete rs web-7f5fdfd87c
replicaset.apps "web-7f5fdfd87c" deleted
arno@x1:~$ kubectl -n $ns delete rs web-57478ff56c
replicaset.apps "web-57478ff56c" deleted
arno@x1:~$ kubectl -n $ns get rs
NAME             DESIRED   CURRENT   READY   AGE
web-5df9f7c798   1         1         1       4d17h
arno@x1:~$ kubectl get pods -A --field-selector status.phase=Failed 
NAMESPACE                                       NAME                   READY   STATUS                   RESTARTS   AGE
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-bbgqz   0/1     ContainerStatusUnknown   1          2d22h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-f2fpj   0/1     ContainerStatusUnknown   1          3d20h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-g7xbd   0/1     ContainerStatusUnknown   1          3d3h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-hv4qs   0/1     ContainerStatusUnknown   1          9h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-p4h7j   0/1     ContainerStatusUnknown   1          4d7h
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu   web-5df9f7c798-rcr45   0/1     ContainerStatusUnknown   1          30h
arno@x1:~$ kubectl delete pods -A --field-selector status.phase=Failed 
pod "web-5df9f7c798-bbgqz" deleted
pod "web-5df9f7c798-f2fpj" deleted
pod "web-5df9f7c798-g7xbd" deleted
pod "web-5df9f7c798-hv4qs" deleted
pod "web-5df9f7c798-p4h7j" deleted
pod "web-5df9f7c798-rcr45" deleted
$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
BALANCE: 408.364243
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Available/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"         "gpu(t/a/u)"  "mem(t/a/u GiB)"       "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/1.2/0.8"          "0/0/0"       "1.82/1.69/0.13"       "25.54/25.54/0"
"worker-01.hurricane2"   "102/47.995/54.005"  "1/1/0"       "196.45/104.36/92.09"  "1808.76/1489.97/318.79"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
34.2          0      64.88       314.4             0             0             11

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          575.7

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"