Closed andy108369 closed 2 months ago
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node1 Ready control-plane 26d v1.28.6 192.168.0.100 <none> Ubuntu 22.04.4 LTS 5.15.0-97-generic containerd://1.7.13
node2 Ready control-plane 26d v1.28.6 192.168.0.101 <none> Ubuntu 22.04.4 LTS 5.15.0-97-generic containerd://1.7.13
node3 Ready <none> 26d v1.28.6 192.168.0.102 <none> Ubuntu 22.04.4 LTS 5.15.0-97-generic containerd://1.7.13
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 0 16h
akash-node-9.0.0 0.30.0
provider-9.1.2 0.5.4
akash-hostname-operator-9.0.5 0.5.4
akash-inventory-operator-9.0.6 0.5.4
ingress-nginx-4.10.0 1.10.0
rook-ceph-v1.13.4 v1.13.4
rook-ceph-cluster-v1.13.4 v1.13.4
PROVIDER INFO
"hostname" "address"
"provider.sg.lnlm.akash.pub" "akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs"
Total/Allocatable/Used (t/a/u) per node:
"name" "cpu(t/a/u)" "gpu(t/a/u)" "mem(t/a/u GiB)" "ephemeral(t/a/u GiB)"
"node1" "64/18446744073709356/-18446744073709292" "0/0/0" "251.45/17179868409.45/-17179868158.01" "395.37/395.37/0"
"node2" "64/18446744073709400/-18446744073709336" "0/0/0" "251.45/17179868584.58/-17179868333.13" "395.37/395.37/0"
"node3" "64/18446744073709400/-18446744073709336" "0/0/0" "251.45/17179868583.06/-17179868331.61" "395.37/395.37/0"
ACTIVE TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
31.5 0 126 0 0 0 31.5
PERSISTENT STORAGE:
"storage class" "available space(GiB)"
"beta3" 1661.88
PENDING TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
$ kubectl -n akash-services logs operator-inventory-bb568b575-dtflg |grep -v 'MODIFIED monitoring CephCluster'
I[2024-03-12|18:07:29.163] using in cluster kube config cmp=provider
INFO rook-ceph ADDED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
INFO watcher.storageclasses started
INFO nodes.nodes waiting for nodes to finish
INFO grpc listening on ":8081"
INFO watcher.config started
INFO rest listening on ":8080"
INFO rook-ceph ADDED monitoring StorageClass {"name": "beta3"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node2"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node3"}
INFO nodes.node.monitor starting {"node": "node2"}
INFO nodes.node.monitor starting {"node": "node1"}
INFO nodes.node.monitor starting {"node": "node3"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node1"}
INFO rancher ADDED monitoring StorageClass {"name": "beta3"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node3"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node2"}
ERROR nodes.node.monitor unable to query cpu {"error": "error trying to reach service: dial tcp 10.233.75.4:8081: connect: invalid argument"}
ERROR nodes.node.monitor unable to query gpu {"error": "error trying to reach service: dial tcp 10.233.75.4:8081: connect: invalid argument"}
INFO nodes.node.monitor started {"node": "node2"}
INFO nodes.node.monitor started {"node": "node3"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node1"}
INFO nodes.node.monitor started {"node": "node1"}
$ kubectl -n akash-services get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
akash-node-1-0 1/1 Running 1 (12d ago) 27d 10.233.71.36 node3 <none> <none>
akash-provider-0 1/1 Running 0 20h 10.233.71.58 node3 <none> <none>
operator-hostname-cdb556d74-x9kb6 1/1 Running 0 8d 10.233.102.158 node1 <none> <none>
operator-inventory-bb568b575-dtflg 1/1 Running 0 20h 10.233.75.5 node2 <none> <none>
operator-inventory-hardware-discovery-node1 1/1 Running 0 20h 10.233.102.143 node1 <none> <none>
operator-inventory-hardware-discovery-node2 1/1 Running 0 20h 10.233.75.4 node2 <none> <none>
operator-inventory-hardware-discovery-node3 1/1 Running 0 20h 10.233.71.50 node3 <none> <none>
$ kubectl -n akash-services logs operator-inventory-hardware-discovery-node2
listening on :8081
$
$ kubectl rollout restart deployment/operator-inventory -n akash-services
deployment.apps/operator-inventory restarted
$ kubectl -n akash-services get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
akash-node-1-0 1/1 Running 1 (12d ago) 27d 10.233.71.36 node3 <none> <none>
akash-provider-0 1/1 Running 0 20h 10.233.71.58 node3 <none> <none>
operator-hostname-cdb556d74-x9kb6 1/1 Running 0 8d 10.233.102.158 node1 <none> <none>
operator-inventory-7b5cb44f6c-9w5dn 1/1 Running 0 5s 10.233.75.32 node2 <none> <none>
operator-inventory-hardware-discovery-node1 1/1 Running 0 3s 10.233.102.187 node1 <none> <none>
operator-inventory-hardware-discovery-node2 1/1 Running 0 3s 10.233.75.8 node2 <none> <none>
operator-inventory-hardware-discovery-node3 1/1 Running 0 3s 10.233.71.45 node3 <none> <none>
$ kubectl -n akash-services logs deployment/operator-inventory -f | grep -v rook
I[2024-03-13|14:34:08.008] using in cluster kube config cmp=provider
INFO nodes.nodes waiting for nodes to finish
INFO rest listening on ":8080"
INFO watcher.storageclasses started
INFO watcher.config started
INFO grpc listening on ":8081"
INFO nodes.node.discovery starting hardware discovery pod {"node": "node1"}
INFO nodes.node.monitor starting {"node": "node2"}
INFO nodes.node.monitor starting {"node": "node3"}
INFO nodes.node.monitor starting {"node": "node1"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node2"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node3"}
INFO rancher ADDED monitoring StorageClass {"name": "beta3"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node1"}
ERROR nodes.node.monitor unable to query cpu {"error": "error trying to reach service: dial tcp 10.233.102.187:8081: connect: connection refused"}
ERROR nodes.node.monitor unable to query gpu {"error": "error trying to reach service: dial tcp 10.233.102.187:8081: connect: connection refused"}
INFO nodes.node.monitor started {"node": "node1"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node2"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node3"}
INFO nodes.node.monitor started {"node": "node3"}
INFO nodes.node.monitor started {"node": "node2"}
recovered:
$ provider_info2.sh provider.sg.lnlm.akash.pub
PROVIDER INFO
"hostname" "address"
"provider.sg.lnlm.akash.pub" "akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs"
Total/Allocatable/Used (t/a/u) per node:
"name" "cpu(t/a/u)" "gpu(t/a/u)" "mem(t/a/u GiB)" "ephemeral(t/a/u GiB)"
"node1" "64/46.53/17.47" "0/0/0" "251.45/193.45/57.99" "395.37/395.37/0"
"node2" "64/46.6/17.4" "0/0/0" "251.45/198.58/52.87" "395.37/395.37/0"
"node3" "64/45.995/18.005" "0/0/0" "251.45/197.06/54.39" "395.37/395.37/0"
ACTIVE TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
31.5 0 126 0 0 0 31.5
PERSISTENT STORAGE:
"storage class" "available space(GiB)"
"beta3" 1663.24
PENDING TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
mon.obl provider reports excessively large amount of gpu for node2
which was
there was network attack to this provider earlier today and node2
was powered off for unknown reason.
Here is the current state:
$ kubectl -n akash-services get pods -l app.kubernetes.io/name=inventory
NAME READY STATUS RESTARTS AGE
operator-inventory-bb568b575-mmcjp 1/1 Running 2 (18h ago) 2d3h
operator-inventory-hardware-discovery-node1 1/1 Running 0 18h
operator-inventory-hardware-discovery-node10 1/1 Running 0 18h
operator-inventory-hardware-discovery-node11 1/1 Running 0 18h
operator-inventory-hardware-discovery-node12 1/1 Running 0 18h
operator-inventory-hardware-discovery-node13 1/1 Running 0 18h
operator-inventory-hardware-discovery-node14 1/1 Running 0 18h
operator-inventory-hardware-discovery-node15 1/1 Running 0 18h
operator-inventory-hardware-discovery-node16 1/1 Running 0 18h
operator-inventory-hardware-discovery-node2 1/1 Running 0 6h15m
operator-inventory-hardware-discovery-node3 1/1 Running 0 18h
operator-inventory-hardware-discovery-node4 1/1 Running 0 18h
operator-inventory-hardware-discovery-node5 1/1 Running 0 18h
operator-inventory-hardware-discovery-node6 1/1 Running 0 18h
operator-inventory-hardware-discovery-node7 1/1 Running 0 18h
operator-inventory-hardware-discovery-node8 1/1 Running 0 18h
operator-inventory-hardware-discovery-node9 1/1 Running 0 18h
$ kubectl -n akash-services logs deployment/operator-inventory | grep -v 'MODIFIED monitoring CephCluster'
I[2024-03-14|04:22:51.569] using in cluster kube config cmp=provider
INFO nodes.nodes waiting for nodes to finish
INFO watcher.storageclasses started
INFO rest listening on ":8080"
INFO grpc listening on ":8081"
INFO watcher.config started
INFO nodes.node.monitor starting {"node": "node10"}
INFO nodes.node.monitor starting {"node": "node1"}
INFO nodes.node.monitor starting {"node": "node12"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node11"}
INFO nodes.node.monitor starting {"node": "node11"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node1"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node12"}
INFO nodes.node.monitor starting {"node": "node14"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node13"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node14"}
INFO nodes.node.monitor starting {"node": "node13"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node10"}
INFO nodes.node.monitor starting {"node": "node16"}
INFO nodes.node.monitor starting {"node": "node15"}
INFO nodes.node.monitor starting {"node": "node2"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node2"}
INFO nodes.node.monitor starting {"node": "node3"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node3"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node16"}
INFO nodes.node.monitor starting {"node": "node4"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node4"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node15"}
INFO nodes.node.monitor starting {"node": "node5"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node5"}
INFO nodes.node.monitor starting {"node": "node6"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node6"}
INFO nodes.node.monitor starting {"node": "node7"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node7"}
INFO nodes.node.monitor starting {"node": "node9"}
INFO nodes.node.monitor starting {"node": "node8"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node9"}
INFO nodes.node.discovery starting hardware discovery pod {"node": "node8"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node10"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node3"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node9"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node5"}
INFO nodes.node.monitor started {"node": "node9"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node14"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node13"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node7"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node1"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node4"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node16"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node11"}
INFO nodes.node.monitor started {"node": "node13"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node8"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node12"}
INFO nodes.node.monitor started {"node": "node11"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node6"}
INFO nodes.node.monitor started {"node": "node7"}
INFO nodes.node.monitor started {"node": "node1"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node15"}
INFO nodes.node.monitor started {"node": "node10"}
INFO nodes.node.monitor started {"node": "node12"}
INFO nodes.node.monitor started {"node": "node3"}
INFO nodes.node.monitor started {"node": "node15"}
INFO nodes.node.monitor started {"node": "node14"}
INFO nodes.node.monitor started {"node": "node6"}
INFO nodes.node.monitor started {"node": "node4"}
INFO nodes.node.monitor started {"node": "node8"}
INFO nodes.node.monitor started {"node": "node5"}
INFO nodes.node.monitor started {"node": "node16"}
ERROR watcher.registry couldn't query inventory registry {"error": "Get \"https://provider-configs.akash.network/devices/gpus\": read tcp 10.233.74.86:39682->172.64.80.1:443: read: connection reset by peer"}
ERROR watcher.registry couldn't query inventory registry {"error": "Get \"https://provider-configs.akash.network/devices/gpus\": dial tcp: lookup provider-configs.akash.network on 169.254.25.10:53: read udp 10.233.74.86:58858->169.254.25.10:53: i/o timeout"}
ERROR watcher.registry couldn't query inventory registry {"error": "Get \"https://provider-configs.akash.network/devices/gpus\": read tcp 10.233.74.86:58328->172.64.80.1:443: read: connection reset by peer"}
INFO nodes.node.discovery started hardware discovery pod {"node": "node2"}
INFO nodes.node.monitor started {"node": "node2"}
After bouncing the inventory-operator it normalized:
It seems that the nvdp-nvidia-device-plugin-dgfdg
did not have enough time to fully initialize before operator-inventory-hardware-discovery-node2
would assess the amount of GPUs available on the node2
.
Im also having some weird issues. When This happens I cannot bid for gpus on a different node. Fixing it requires bouncing of the operator-inventory
PROVIDER INFO
"hostname" "address"
"provider.pcgameservers.com" "akash17l0f3kf7gv4kmgqjmgc0ksj3em6lqgcc4kl4dg"
Total/Allocatable/Used (t/a/u) per node:
"name" "cpu(t/a/u)" "gpu(t/a/u)" "mem(t/a/u GiB)" "ephemeral(t/a/u GiB)"
"node1" "8/5.88/2.12" "0/0/0" "7.51/5.87/1.64" "43.13/43.13/0"
"node2" "48/38.575/9.425" "4/0/4" "115.12/66.07/49.04" "586.82/536.82/50"
"node3" "128/111.025/16.975" "0/0/0" "143.76/96.29/47.47" "352.06/193.06/159"
"node4" "128/44.145/83.855" "1/1/0" "52.58/17179869145.17/-17179869092.59" "290.06/110.44/179.62"
"node5" "8/3.825/4.175" "2/1/1" "52.57/36.05/16.52" "453.94/412.03/41.91"
"node6" "32/18.425/13.575" "1/0/1" "47.8/30.14/17.66" "175.12/155.12/20"
"node7" "256/132.275/123.725" "3/1/2" "288.16/204.36/83.8" "352.06/287.56/64.5"
Narrowing the issue down, based on the providers uptime (~5 days) - it appears that only providers that have or had nvdp/nvidia-device-plugin
installed are experiencing this issue.
Couple of additional observations:
kubectl rollout restart deployment/operator-inventory -n akash-services
)ERROR watcher.registry couldn't query pci.ids {"error": "Get \"\": unsupported protocol scheme \"\""}
Now the Hurricane provider keeps always reporting 18446744073709524
allocatable cpu's even after I restart inventory-operator which usually temporarily fixed the issue, until now.
New issue after restarting a worker node node5 which had akash-provider-0 and Operator-Inventory running on. I have 11 gpus total, says 7 active but inventory says all 11gpus are used. 0 gpus are pending. Fix by bouncing akash-provider-0 and operator-inventory. Then the inventory started to show correctly again.
PROVIDER INFO
"hostname" "address"
"provider.pcgameservers.com" "akash17l0f3kf7gv4kmgqjmgc0ksj3em6lqgcc4kl4dg"
Total/Allocatable/Used (t/a/u) per node:
"name" "cpu(t/a/u)" "gpu(t/a/u)" "mem(t/a/u GiB)" "ephemeral(t/a/u GiB)"
"node1" "8/0.38/7.62" "0/0/0" "7.51/0.37/7.14" "43.13/37.63/5.5"
"node2" "48/15.575/32.425" "4/0/4" "115.12/40.12/74.99" "586.82/248.06/338.76"
"node3" "128/114.025/13.975" "0/0/0" "143.76/97.89/45.87" "352.06/249.93/102.13"
"node4" "128/108.795/19.205" "1/0/1" "52.58/12.05/40.52" "290.06/97.62/192.44"
"node5" "8/0.525/7.475" "2/0/2" "52.57/27.37/25.2" "453.94/284.24/169.7"
"node6" "32/18.425/13.575" "1/0/1" "47.8/30.14/17.66" "175.12/155.12/20"
"node7" "256/108.275/147.725" "3/0/3" "288.16/161.36/126.8" "352.06/217.56/134.5"
ACTIVE TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
166.5 7 174.31 290.07 0 0 52.5
PERSISTENT STORAGE:
"storage class" "available space(GiB)"
"beta3" 356.66
PENDING TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
15 0 0.5 0.5 0 0 0
Inventory completely stopped working.
{"cluster":{"leases":12,"inventory":{"active":[{"cpu":4000,"gpu":4,"memory":37580963840,"storage_ephemeral":53687091200},{"cpu":1500,"gpu":0,"memory":5368709120,"storage_ephemeral":8589934592},{"cpu":1000,"gpu":0,"memory":2147483648,"storage_ephemeral":1073741824,"storage":{"beta3":1073741824}},{"cpu":2000,"gpu":0,"memory":16000000000,"storage_ephemeral":100000000000},{"cpu":128000,"gpu":0,"memory":34359738368,"storage_ephemeral":32212254720},{"cpu":4000,"gpu":0,"memory":12884901888,"storage_ephemeral":1610612736},{"cpu":1000,"gpu":0,"memory":8000000000,"storage_ephemeral":30000000000},{"cpu":2000,"gpu":2,"memory":37580963840,"storage_ephemeral":53687091200},{"cpu":4000,"gpu":0,"memory":8589934592,"storage_ephemeral":2147483648,"storage":{"beta3":10737418240}},{"cpu":12000,"gpu":1,"memory":17179869184,"storage_ephemeral":21474836480},{"cpu":5000,"gpu":0,"memory":5368709120,"storage_ephemeral":5368709120,"storage":{"beta3":42949672960}},{"cpu":2000,"gpu":0,"memory":2097741824,"storage_ephemeral":1610612736,"storage":{"beta3":1610612736}}],"available":{"nodes":[{"name":"node1","allocatable":{"cpu":8000,"gpu":0,"memory":8068288512,"storage_ephemeral":46314425473},"available":{"cpu":380,"gpu":0,"memory":400015360,"storage_ephemeral":40408845441}},{"name":"node2","allocatable":{"cpu":48000,"gpu":4,"memory":123604434944,"storage_ephemeral":630096038893},"available":{"cpu":15575,"gpu":0,"memory":43081193472,"storage_ephemeral":266350410733}},{"name":"node3","allocatable":{"cpu":128000,"gpu":0,"memory":154365534208,"storage_ephemeral":378025411573},"available":{"cpu":114025,"gpu":0,"memory":105111463936,"storage_ephemeral":268361735157}},{"name":"node4","allocatable":{"cpu":128000,"gpu":1,"memory":56455852032,"storage_ephemeral":311444659299},"available":{"cpu":108795,"gpu":0,"memory":12943570944,"storage_ephemeral":104814129251}},{"name":"node5","allocatable":{"cpu":8000,"gpu":2,"memory":56443244544,"storage_ephemeral":487414664409},"available":{"cpu":525,"gpu":0,"memory":29388990464,"storage_ephemeral":305202409689}},{"name":"node6","allocatable":{"cpu":32000,"gpu":1,"memory":51326119936,"storage_ephemeral":188036982064},"available":{"cpu":18425,"gpu":0,"memory":32362043392,"storage_ephemeral":166562145584}},{"name":"node7","allocatable":{"cpu":256000,"gpu":3,"memory":309405798400,"storage_ephemeral":378025411573},"available":{"cpu":108275,"gpu":0,"memory":173257062400,"storage_ephemeral":233607136245}}],"storage":[{"class":"beta3","size":382961909760}]}}},"bidengine":{"orders":0},"manifest":{"deployments":0},"cluster_public_hostname":"provider.pcgameservers.com","address":"akash17l0f3kf7gv4kmgqjmgc0ksj3em6lqgcc4kl4dg"}
After starting both services for "akash-provider-0, Operatory-inventory"
PROVIDER INFO
"hostname" "address"
"provider.pcgameservers.com" "akash17l0f3kf7gv4kmgqjmgc0ksj3em6lqgcc4kl4dg"
Total/Allocatable/Used (t/a/u) per node:
"name" "cpu(t/a/u)" "gpu(t/a/u)" "mem(t/a/u GiB)" "ephemeral(t/a/u GiB)"
"node1" "8/5.38/2.62" "0/0/0" "7.51/5.62/1.89" "43.13/43.13/0"
"node2" "48/38.575/9.425" "4/0/4" "115.12/66.07/49.04" "586.82/536.82/50"
"node3" "128/114.025/13.975" "0/0/0" "143.76/97.89/45.87" "352.06/249.93/102.13"
"node4" "128/93.795/34.205" "1/1/0" "52.58/7.55/45.02" "290.06/227.62/62.44"
"node5" "8/6.025/1.975" "2/2/0" "55.45/53.45/2" "453.94/453.94/0"
"node6" "32/18.425/13.575" "1/0/1" "47.8/30.14/17.66" "175.12/155.12/20"
"node7" "256/114.275/141.725" "3/1/2" "288.16/196.36/91.8" "352.06/267.56/84.5"
ACTIVE TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
166.5 7 174.31 290.07 0 0 52.5
PERSISTENT STORAGE:
"storage class" "available space(GiB)"
"beta3" 396.65
PENDING TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
Now the Hurricane provider keeps always reporting
18446744073709524
allocatable cpu's even after I restart inventory-operator which usually temporarily fixed the issue, until now.
Fixed the Hurricane reporting. Possibly it was caused by some of the deployments in Failed
state.
I've cleaned them up after which reporting looks good:
arno@x1:~$ kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o wide --field-selector status.phase=Failed
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hg49sq80mpk3e7q7m43asnrhe1tu9639usr0psr7fkq7m miner-xmrig-7877f4f8d9-9txlz 0/1 Error 1 17d <none> worker-01.hurricane2 <none> <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu web-f7785f6c6-p6m69 0/1 ContainerStatusUnknown 2 (13d ago) 14d <none> worker-01.hurricane2 <none> <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu web-f7785f6c6-twxrr 0/1 ContainerStatusUnknown 1 (11d ago) 11d <none> worker-01.hurricane2 <none> <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu web-f7785f6c6-5dgbl 0/1 ContainerStatusUnknown 2 (6d22h ago) 7d16h <none> worker-01.hurricane2 <none> <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu web-f7785f6c6-95b2k 0/1 ContainerStatusUnknown 1 5d6h <none> worker-01.hurricane2 <none> <none>
2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu web-f7785f6c6-9nh5b 0/1 ContainerStatusUnknown 1 4d17h <none> worker-01.hurricane2 <none> <none>
arno@x1:~$ kubectl -n 2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu get rs
NAME DESIRED CURRENT READY AGE
web-f7785f6c6 1 1 1 14d
arno@x1:~$ kubectl -n 2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu get pods
NAME READY STATUS RESTARTS AGE
web-f7785f6c6-2r2j6 0/1 Completed 0 2d11h
web-f7785f6c6-4m462 0/1 Completed 2 (5d18h ago) 6d10h
web-f7785f6c6-5dgbl 0/1 ContainerStatusUnknown 2 (6d22h ago) 7d16h
web-f7785f6c6-95b2k 0/1 ContainerStatusUnknown 1 5d6h
web-f7785f6c6-9nh5b 0/1 ContainerStatusUnknown 1 4d17h
web-f7785f6c6-dsjp8 0/1 Completed 7 (8d ago) 11d
web-f7785f6c6-fl49h 0/1 Completed 0 3d4h
web-f7785f6c6-g2sfx 0/1 Completed 2 (12d ago) 12d
web-f7785f6c6-j2prf 1/1 Running 4 (86m ago) 2d1h
web-f7785f6c6-p6m69 0/1 ContainerStatusUnknown 2 (13d ago) 14d
web-f7785f6c6-pk98k 0/1 Completed 3 (3d13h ago) 4d5h
web-f7785f6c6-q89gg 0/1 Completed 0 8d
web-f7785f6c6-twxrr 0/1 ContainerStatusUnknown 1 (11d ago) 11d
web-f7785f6c6-z8w9f 0/1 Completed 0 2d20h
arno@x1:~$ kubectl -n hg49sq80mpk3e7q7m43asnrhe1tu9639usr0psr7fkq7m get rs
NAME DESIRED CURRENT READY AGE
miner-xmrig-7877f4f8d9 1 1 1 21d
arno@x1:~$ kubectl -n hg49sq80mpk3e7q7m43asnrhe1tu9639usr0psr7fkq7m get pods
NAME READY STATUS RESTARTS AGE
miner-xmrig-7877f4f8d9-8mfmt 1/1 Running 1 (86m ago) 7d15h
miner-xmrig-7877f4f8d9-9txlz 0/1 Error 1 17d
arno@x1:~$ kubectl delete pods -A --field-selector status.phase=Failed
pod "web-f7785f6c6-5dgbl" deleted
pod "web-f7785f6c6-95b2k" deleted
pod "web-f7785f6c6-9nh5b" deleted
pod "web-f7785f6c6-p6m69" deleted
pod "web-f7785f6c6-twxrr" deleted
pod "miner-xmrig-7877f4f8d9-9txlz" deleted
arno@x1:~$ kubectl -n hg49sq80mpk3e7q7m43asnrhe1tu9639usr0psr7fkq7m get pods
NAME READY STATUS RESTARTS AGE
miner-xmrig-7877f4f8d9-8mfmt 1/1 Running 1 (86m ago) 7d15h
arno@x1:~$ kubectl -n 2kmrpu3u3u5psi5hip2oqt0bajako0e9o9phc5j5c5dmu get pods
NAME READY STATUS RESTARTS AGE
web-f7785f6c6-2r2j6 0/1 Completed 0 2d11h
web-f7785f6c6-4m462 0/1 Completed 2 (5d18h ago) 6d10h
web-f7785f6c6-dsjp8 0/1 Completed 7 (8d ago) 11d
web-f7785f6c6-fl49h 0/1 Completed 0 3d4h
web-f7785f6c6-g2sfx 0/1 Completed 2 (12d ago) 12d
web-f7785f6c6-j2prf 1/1 Running 4 (86m ago) 2d1h
web-f7785f6c6-pk98k 0/1 Completed 3 (3d13h ago) 4d5h
web-f7785f6c6-q89gg 0/1 Completed 0 8d
web-f7785f6c6-z8w9f 0/1 Completed 0 2d20h
$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
"hostname" "address"
"provider.hurricane.akash.pub" "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"
Total/Allocatable/Used (t/a/u) per node:
"name" "cpu(t/a/u)" "gpu(t/a/u)" "mem(t/a/u GiB)" "ephemeral(t/a/u GiB)"
"control-01.hurricane2" "2/1.2/0.8" "0/0/0" "1.82/1.69/0.13" "25.54/25.54/0"
"worker-01.hurricane2" "102/18.795/83.205" "1/1/0" "196.45/102.2/94.25" "1808.76/1548.18/260.58"
ACTIVE TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
63 0 59.73 236.68 0 0 0
PERSISTENT STORAGE:
"storage class" "available space(GiB)"
"beta3" 276.44
PENDING TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
1.8 0 8 10 0 0 0
it appears to be still an issue with provider-services v0.5.9 & v0.5.11:
$ curl -s -k https://provider.sg.lnlm.akash.pub:8443/status | jq -r . | grep -C1 1844
"available": {
"cpu": 18446744073709520000,
"gpu": 0,
"memory": 18446743950448112000,
"storage_ephemeral": 424525602114
--
"available": {
"cpu": 18446744073709496000,
"gpu": 0,
"memory": 18446743844816663000,
"storage_ephemeral": 424525602114
--
"available": {
"cpu": 18446744073709537000,
"gpu": 0,
"memory": 18446744023303657000,
"storage_ephemeral": 424525602114
(pdx.nb.akash.pub 4090s provider) It appears that this has something to do with the failing pods too. For instance, 3 nodes with 8x 4090's each.
The node1,node2 had wrong nvidia.ko driver version installed 550 instead of 535. I've reinstalled it while these deployments were running, restarted all 3 nodes.
This caused some pods stuck in ContainerStatusUnknown
state:
$ kubectl get pods -A --sort-by='{.metadata.creationTimestamp}' -o wide --field-selector status.phase=Failed
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
3n3mvl6qqh1bkk41pou3dkfttkjglo6tua36udmh6n4fm service-1-dd746bf44-4745c 0/1 ContainerStatusUnknown 0 43m <none> node1 <none> <none>
1qhlsoi0sqj2rot1otov7vhfao2j0cnmbuvkj2qd16ese service-1-76f8b9cf6d-dp7x2 0/1 ContainerStatusUnknown 0 33m <none> node2 <none> <none>
va7h4phadmnfld29qd5rdtdsgk6eupf39d6etke05t0fu service-1-7c59bdb7df-j586f 0/1 ContainerStatusUnknown 0 32m <none> node2 <none> <none>
qg9lq6q8tcta1p2m9fuc1pdbjfispht8q7e7iun6t5s2e service-1-59974dfd89-sgvjg 0/1 Unknown 0 32m <none> node2 <none> <none>
6ng2gu6vf5p8qg5bde5udse1e34igb1bn15kaeupiuhva service-1-59c44cd758-mzvsj 0/1 Unknown 0 30m <none> node2 <none> <none>
d62adnou0v7b5s7h3t8gnh0av540fcok9bk56u72f3je2 service-1-7cffd45f48-w25vt 0/1 ContainerStatusUnknown 0 28m <none> node2 <none> <none>
of3uincqjlja5ekk8cbfuormpm4dmn8v403c535f4dc4m service-1-7dd5dffbc4-brgvg 0/1 ContainerStatusUnknown 0 25m <none> node1 <none> <none>
rfij4esvggf9cqqnpf2hq266o0nba01t5iq918bu1v9iu service-1-58bf676fdc-ph4f9 0/1 ContainerStatusUnknown 0 23m <none> node2 <none> <none>
6pgohd98lm7gs5rb2kv5bnc4c9920jtvfvmg4ikqvhn8a service-1-55858fc545-6fr5l 0/1 ContainerStatusUnknown 0 22m <none> node1 <none> <none>
guf6r2fhenpfljbhncip9sbei3ss43av4kaau95kl4rpq service-1-598c857c89-wtkvh 0/1 ContainerStatusUnknown 0 22m <none> node2 <none> <none>
tsu3ue9housp0ehjsr51psu4aambpaqvtpuninpl07hqs service-1-55fd66f6f5-2hfqv 0/1 ContainerStatusUnknown 0 21m <none> node1 <none> <none>
qrka4mab6esns6blt8jaeos663j0e6sbp9cfghi02jc4i service-1-84c67b446-5v55h 0/1 ContainerStatusUnknown 0 20m <none> node1 <none> <none>
eogo7cjtebo8fr0g9l1mfmo17j5r3efi1hitlpki3g04g service-1-84988c5fb6-8rhtq 0/1 ContainerStatusUnknown 0 20m <none> node1 <none> <none>
g5i1ml6bhnfso9faglp1gv167f8acegsv335hlkq0dlfc service-1-7d66cdd98c-572vx 0/1 ContainerStatusUnknown 0 18m <none> node1 <none> <none>
Which in turn triggered this bug (see GPU count for node1,node2):
$ provider_info2.sh provider.pdx.nb.akash.pub
PROVIDER INFO
"hostname" "address"
"provider.pdx.nb.akash.pub" "akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv"
Total/Available/Used (t/a/u) per node:
"name" "cpu(t/a/u)" "gpu(t/a/u)" "mem(t/a/u GiB)" "ephemeral(t/a/u GiB)"
"node1" "128/21.38/106.62" "0/18446744073709552000/-18446744073709552000" "503.61/401.45/102.16" "6385.77/5515.21/870.55"
"node2" "128/24.12/103.88" "8/18446744073709552000/-18446744073709552000" "503.61/375.96/127.65" "6385.77/4059.33/2326.43"
"node3" "128/22.425/105.575" "8/1/7" "503.61/400.99/102.62" "6385.77/5515.21/870.55"
ACTIVE TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
149 14 152.74 2033.77 0 0 0
PERSISTENT STORAGE:
"storage class" "available space(GiB)"
"beta3" 1699.04
PENDING TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
I've deleted those that stuck in ContainerStatusUnknown
and stats immediately recovered:
kubectl delete pods -A --field-selector status.phase=Failed
$ provider_info2.sh provider.pdx.nb.akash.pub
PROVIDER INFO
"hostname" "address"
"provider.pdx.nb.akash.pub" "akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv"
Total/Available/Used (t/a/u) per node:
"name" "cpu(t/a/u)" "gpu(t/a/u)" "mem(t/a/u GiB)" "ephemeral(t/a/u GiB)"
"node1" "128/57.38/70.62" "8/4/4" "503.61/434.97/68.64" "6385.77/5938.73/447.03"
"node2" "128/73.12/54.88" "8/1/7" "503.61/435.56/68.05" "6385.77/5222.55/1163.22"
"node3" "128/22.425/105.575" "8/1/7" "503.61/400.99/102.62" "6385.77/5515.21/870.55"
ACTIVE TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
181 16 182.54 2257.29 0 0 0
PERSISTENT STORAGE:
"storage class" "available space(GiB)"
"beta3" 1699.04
PENDING TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
16 1 14.9 111.76 0 0 0
I think the pdx.nb provider is a good candidate to start monitoring the issue, since it appears this issue started to occur often after node1.pdx.nb.akash.pub node has been replaced yesterday (the mainboard, GPU's & ceph disk), except for the main OS disks (rootfs).
$ provider_info2.sh provider.pdx.nb.akash.pub
PROVIDER INFO
"hostname" "address"
"provider.pdx.nb.akash.pub" "akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv"
Total/Available/Used (t/a/u) per node:
"name" "cpu(t/a/u)" "gpu(t/a/u)" "mem(t/a/u GiB)" "ephemeral(t/a/u GiB)"
"node1" "128/11.65/116.35" "8/1/7" "503.59/240.78/262.82" "6385.77/5603.46/782.31"
"node2" "128/39.95/88.05" "8/0/8" "503.61/418.51/85.1" "6385.77/5138.73/1247.03"
"node3" "128/6.325/121.675" "8/0/8" "503.61/368.93/134.68" "6385.77/5403.46/982.31"
ACTIVE TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
356 26 583.64 3346.93 0 0 770
PERSISTENT STORAGE:
"storage class" "available space(GiB)"
"beta3" 846.13
PENDING TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
$ provider_info2.sh provider.pdx.nb.akash.pub
PROVIDER INFO
"hostname" "address"
"provider.pdx.nb.akash.pub" "akash1t0sk5nhc8n3xply5ft60x9det0s7jwplzzycnv"
Total/Available/Used (t/a/u) per node:
"name" "cpu(t/a/u)" "gpu(t/a/u)" "mem(t/a/u GiB)" "ephemeral(t/a/u GiB)"
"node1" "128/18446744073709468/-18446744073709340" "8/18446744073709552000/-18446744073709552000" "503.59/16.78/486.82" "6385.77/4932.9/1452.86"
"node2" "128/39.95/88.05" "8/0/8" "503.61/418.51/85.1" "6385.77/5138.73/1247.03"
"node3" "128/6.325/121.675" "8/0/8" "503.61/368.93/134.68" "6385.77/5403.46/982.31"
ACTIVE TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
356 26 583.64 3346.93 0 0 770
PERSISTENT STORAGE:
"storage class" "available space(GiB)"
"beta3" 846.13
PENDING TOTAL:
"cpu(cores)" "gpu" "mem(GiB)" "ephemeral(GiB)" "beta1(GiB)" "beta2(GiB)" "beta3(GiB)"
$ kubectl -n akash-services get pods
NAME READY STATUS RESTARTS AGE
akash-node-1-0 1/1 Running 2 (5d16h ago) 5d17h
akash-provider-0 1/1 Running 0 15h
operator-hostname-574d8699d-c22w5 1/1 Running 3 (3d11h ago) 5d17h
operator-inventory-75df5b6fb5-2k897 1/1 Running 0 55m
operator-inventory-hardware-discovery-node1 1/1 Running 0 55m
operator-inventory-hardware-discovery-node2 1/1 Running 0 55m
operator-inventory-hardware-discovery-node3 1/1 Running 0 55m
$ kubectl -n akash-services logs deployment/operator-inventory --timestamps
2024-04-10T09:19:38.827373163Z I[2024-04-10|09:19:38.827] using in cluster kube config cmp=provider
2024-04-10T09:19:39.849250140Z INFO rook-ceph ADDED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:19:39.874528936Z INFO nodes.nodes waiting for nodes to finish
2024-04-10T09:19:39.874554196Z INFO grpc listening on ":8081"
2024-04-10T09:19:39.874575036Z INFO watcher.storageclasses started
2024-04-10T09:19:39.874577926Z INFO watcher.config started
2024-04-10T09:19:39.874582506Z INFO rest listening on ":8080"
2024-04-10T09:19:39.877335127Z INFO rook-ceph ADDED monitoring StorageClass {"name": "beta3"}
2024-04-10T09:19:39.878384595Z INFO nodes.node.monitor starting {"node": "node1"}
2024-04-10T09:19:39.878393235Z INFO nodes.node.discovery starting hardware discovery pod {"node": "node1"}
2024-04-10T09:19:39.878405655Z INFO nodes.node.monitor starting {"node": "node2"}
2024-04-10T09:19:39.878415995Z INFO nodes.node.discovery starting hardware discovery pod {"node": "node2"}
2024-04-10T09:19:39.878422595Z INFO nodes.node.monitor starting {"node": "node3"}
2024-04-10T09:19:39.878454075Z INFO nodes.node.discovery starting hardware discovery pod {"node": "node3"}
2024-04-10T09:19:39.885778559Z INFO rancher ADDED monitoring StorageClass {"name": "beta3"}
2024-04-10T09:19:42.756543562Z INFO nodes.node.discovery started hardware discovery pod {"node": "node3"}
2024-04-10T09:19:42.916743883Z INFO nodes.node.discovery started hardware discovery pod {"node": "node2"}
2024-04-10T09:19:43.183058728Z INFO nodes.node.discovery started hardware discovery pod {"node": "node1"}
2024-04-10T09:19:43.202991245Z INFO nodes.node.monitor started {"node": "node2"}
2024-04-10T09:19:43.426359532Z INFO nodes.node.monitor started {"node": "node1"}
2024-04-10T09:19:44.084855728Z INFO nodes.node.monitor started {"node": "node3"}
2024-04-10T09:19:49.948809750Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:20:50.587788813Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:21:51.246518697Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:22:51.918597738Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:23:52.573545877Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:24:53.223854209Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:25:53.901694220Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:26:54.559292611Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:27:55.217895603Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:28:55.871364360Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:29:56.526719703Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:30:57.202289863Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:31:57.853717274Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:32:58.516702894Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:33:59.173653862Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:34:59.835751168Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:36:00.494909031Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:37:01.155779061Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:38:01.817805371Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:39:02.475386563Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:40:03.139422647Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:41:03.806580305Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:42:04.457760854Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:43:05.106685230Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:44:05.749807485Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:45:06.406388585Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:46:07.065620189Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:47:07.734040162Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:48:08.401818233Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:49:09.064330123Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:50:09.719065483Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:51:10.362797022Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:52:11.025336455Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:53:11.682972539Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:54:12.349148024Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:55:13.014021018Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:56:13.689351710Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:57:14.351208230Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:58:15.017461998Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T09:59:15.672108483Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:00:16.317829774Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:01:16.969684860Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:02:17.638128924Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:03:18.284770571Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:04:18.947803687Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:05:19.593905389Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:06:20.263411687Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:07:20.903124077Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:08:21.553959836Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:09:22.205702368Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:10:22.857968005Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:11:23.532570698Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:12:24.181899391Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:13:24.839654285Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
2024-04-10T10:14:25.493307725Z INFO rook-ceph MODIFIED monitoring CephCluster {"ns": "rook-ceph", "name": "rook-ceph"}
this might be also the trigger:
NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE
sj264h0mg6bq9alvkqtnq69ubjd9ptq4ubdfkuc9i6rdm 14m Warning FailedScheduling pod/service-1-0 0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
akash-services 60m Normal Scheduled pod/operator-inventory-75df5b6fb5-2k897 Successfully assigned akash-services/operator-inventory-75df5b6fb5-2k897 to node2
eg0vmr4qmf9kdumohtdahhqq14aa3i5q1dutblo0jugc2 14m Warning FailedScheduling pod/service-1-0 0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
eip4fok1c0g4eome40s2r4u3941at4sua6rlh842c07dq 14m Warning FailedScheduling pod/service-1-0 0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..```
FWIW, `11/26` leases are using `beta3` persistent storage.
I know how to contact the owner of most of all the 26 leases on pdx.nb provider if needed.
Looks like this triggered the "excessively large stats" issue on the sg.lnlm.akash.pub:
2024-04-11T08:10:21.154985407Z ERROR watcher.registry couldn't query pci.ids {"error": "Get \"\": unsupported protocol scheme \"\""}
complete logs with the timestamps: sg.lnlm.akash.pub.deployment-operator-inventory.log
there haven't been lease-created
nor lease-closed
for this provider in the past week.
lease-created
$ provider-services query txs --events "akash.v1.provider=akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs&akash.v1.module=market&akash.v1.action=lease-created" --page 1 --limit 100 -o json | jq -r '.txs[] | [.timestamp, .height, .txhash, .code, (.tx.body.messages[] | ."@type"), (.logs[].events[].attributes[] | (select(.key == "action") | .value), (select(.key == "dseq") | .value), (select(.key == "provider") | .value), (select(.key == "price-amount") | .value))] | @csv'
lease-closed
provider-services query txs --events "akash.v1.provider=akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs&akash.v1.module=market&akash.v1.action=lease-closed" --page 1 --limit 100 -o json | jq -r '.txs[] | [.timestamp, .height, .txhash, .code, (.tx.body.messages[] | ."@type"), (.logs[].events[].attributes[] | (select(.key == "action") | .value), (select(.key == "dseq") | .value), (select(.key == "provider") | .value), (select(.key == "price-amount") | .value))] | @csv'
however, there have been some bid-created
& bid-closed
events just today:
bid-created
$ provider-services query txs --events "akash.v1.provider=akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs&akash.v1.module=market&akash.v1.action=bid-created" --page 1 --limit 100 -o json | jq -r '.txs[] | [.timestamp, .height, .txhash, .code, (.tx.body.messages[] | ."@type"), (.logs[].events[].attributes[] | (select(.key == "action") | .value), (select(.key == "dseq") | .value), (select(.key == "provider") | .value), (select(.key == "price-amount") | .value))] | @csv'
...
"2024-04-05T07:27:16Z","15741103","96056FFA015683ECAD1DC2C8E19FC886B552F04712BBA43B07DF446CD3E910B8",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15741100","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.439409000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-10T08:26:10Z","15813611","84BC3A4364E8E5D6B7D843C094009FE4076A7FE9721F126B69AD7D146B1913C9",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15813608","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.426759000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-10T10:12:10Z","15814669","CD0ABD45313F9C0EE96359D6D0197E92E1CCF1359B4B146ACEAF7AEEEB009B4D",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15814666","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.430669000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T05:36:13Z","15826304","5EA500D96198A996371D2D304BA75A720D6D89C428A263BB2F4E19DA9CAD829D",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15826301","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.386220000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:45:06Z","15827589","79B2A29D9CA2CB9BB1FC9D242B7963A6F86BE318C1189ABEA0544273F807DDE6",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827587","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:46:06Z","15827599","744DBD61B2E7EF42EB6DE1D992AED30CD1A5BEADA676F21D336B9EF37AAE2E9D",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827597","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:47:07Z","15827609","1FC2EB06D72087FD033EA5090F65C945AAB88D832995F8DDC4CE5AC3710834AA",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827607","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:48:19Z","15827621","AA70E01582A99C6FE9752B28084B05A9BCEEF1B27A6006B5711581AEC0668DA5",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827617","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:49:07Z","15827629","66E5102E111D40E33C5CCA6EF5CA48F07B51ABAB4E497C817F3DB7646D6F9A72",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827627","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:50:20Z","15827641","774A22CD3D09CCA86968B07E1D6DE89C4C190645B541C73F5B56BE0D55B6C18A",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827638","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:51:20Z","15827651","273BEB331C3392D5D3F0CE04A9B8503DCCF95DDC7E519B90CC6C258DA06CD88B",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827648","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:52:20Z","15827661","8DBAB0C611FA344EAEE61B5270A0284EF5C6B4865C0288C8DCEDAF9A0BAF9818",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827658","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:53:17Z","15827670","A97A93C446D05E85E1277EF5451329F2F229C6694AA9836BA7C012DCF9600FE2",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827668","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:54:17Z","15827680","3BDF8918413223D81E5802FAA7307657276AB9DDB60BE792CD2FFF7B8DF97F64",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827678","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:55:33Z","15827692","054DA97FD371249CD17069D197992469A6F8EB39DB14400EC901E5A11912276B",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827689","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:56:32Z","15827702","EFAA5C8E885E211D0109202217E8AC3A9D05D3381406208F0C8CC65FC573E8F4",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827699","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:57:32Z","15827712","17E2C69CF1BAE820FF25C00CF7CBD0640A5A7A202872A93371D3554B55C90099",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827709","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:58:31Z","15827722","56498D973432ECC50E19029BA23269AB77CE44A2E172BD3597D3753F1B2F5A7C",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827720","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T07:59:32Z","15827732","BA0CAB4E094D1BFC24F4B5BBA7A21EA5EEA42912F6BFD862D8DB83253497B17D",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827730","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T08:00:52Z","15827745","66E8BD7FDF6EFFF86FC247EE1BB1503891C94E3EE53426C1D94377AF64916AC9",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827742","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T08:01:50Z","15827754","177463C0403E1C21BA5E5B9AC88D21264BA338509DFC8AB462DAD424D8DFC40E",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827751","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T08:02:49Z","15827764","14FCD0FF5C688C1E66F7D25229E6DE10AD9F8429DC32524235CE3D07251649DC",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827762","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T08:03:19Z","15827769","7356195DD0C8E5F5AC77552C3B654A0012EDF12B08FC438701CF5BB26DE6C045",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827766","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
"2024-04-11T08:03:38Z","15827772","CE57B2AB2612A2136B1BAFCDC6DC618634D6D8BE7E503CAFBC8BE3197AD46A44",0,"/akash.market.v1beta4.MsgCreateBid","bid-created","15827770","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCreateBid"
bid-closed
$ provider-services query txs --events "akash.v1.provider=akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs&akash.v1.module=market&akash.v1.action=bid-closed" --page 1 --limit 100 -o json | jq -r '.txs[] | [.timestamp, .height, .txhash, .code, (.tx.body.messages[] | ."@type"), (.logs[].events[].attributes[] | (select(.key == "action") | .value), (select(.key == "dseq") | .value), (select(.key == "provider") | .value), (select(.key == "price-amount") | .value))] | @csv'
...
"2024-03-25T13:27:13Z","15585782","40182C4E571FA08B191477AB133E0476FD4340896AE83BF4EAD73CF17079E436",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15585729","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.023612000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-03-29T13:39:57Z","15643714","23F0F6A6F91E9CA5B7DF9D5D00016BA9CF99EC70007995B9381F2363BA6629A1",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15643660","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.076710000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:50:08Z","15827639","65A7CD516C85A4AD78AEED5A172A334C4211B1657BDABE461A6D79B3CF1DBB3E",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827587","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:51:14Z","15827650","7F87BB1729137662B4BB83F4D433874D83214321157EDBAB96B8966F387656BB",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827597","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:52:14Z","15827660","021A2C39A5325FB64CC82DE6FFB72388DC73779F6B4BA92215E4AF9689E74C2D",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827607","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:53:23Z","15827671","F84104F7B3D92762AA7AD7BD934B1718FBBCFFB98330BE25B488E60C19593A28",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827617","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:54:11Z","15827679","2B5B6C5E421DE0391805B8A6F2E02CAFF63B180B4E28A66EFC68EB4B83CA8D72",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827627","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:55:27Z","15827691","3DC611BF65C1950AAC75FA6F5C91FA7F82000450F1B9D5ADF539D001081E0D6F",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827638","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:56:26Z","15827701","90A4C06DFE401B45D6E24EB30687A25297A559A718065E700074BEC266A3022C",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827648","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:57:26Z","15827711","097BD4458BE4D85A92304AAC44E48AEB1A1D40C951F2457A55122D425DAE5DC0",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827658","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:58:19Z","15827720","B921C408906F0DDCD0093C588AEBAAC3B36AA47193561E8B091FC9945F28CB10",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827668","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T07:59:20Z","15827730","50935281BEE2C55275096D7CE46DA21484335B06C72EC6CF34C629F5C6E5649B",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827678","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T08:00:38Z","15827743","9C1314BBD1CA12C18750D87A68098F2B6E86B8D8AAEFFF20946FCD970057F882",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827689","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T08:01:33Z","15827752","744924EA176AE8F769D46D036720175188CE769FA4DCBEBBBF815FD78E02C266",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827699","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
"2024-04-11T08:04:45Z","15827783","A8BA8BA520E485F64AA00116811EC463BEBAF5585C8664A694E87F01AF52AAE9",0,"/akash.market.v1beta4.MsgCloseBid","bid-closed","15827730","akash1zsdzjknq6u475ul8ef4gxh527kz82k6jph8vrs","1.377504000000000000","/akash.market.v1beta4.MsgCloseBid"
The provider 0.5.12
does not exhibit the issue of excessive resource reporting :rocket:
Next steps:
akash network 0.30.0 provider 0.5.4
Observation
I've installed
nvdp/nvidia-device-plugin
helm-chart by mistake and then removed after short time:sometimes provider will report excessively large amount of Allocatable cpu & ram
I reinstalled the
operator-inventory
, it helped at the first look. However, after some time I've noticed the issue appeared again:Additionally, I've noticed this error in the
operator-inventory
, but soon figured that it doesn't seem to be the cause comparing to the other providers which seen the same error in their inventory operator:Provider logs
sg.lnlm.provider.log
Detailed info (8443/status)
sg.lnlm.provider-info-detailed.log
Additional observations
operator-inventory
run for over 16 minutes, the issue doesn't appear yet. I'll keep monitoring it.