Open andy108369 opened 3 months ago
Have observed the same on provider.h100.wdc.val.akash.pub
provider:
$ grpcurl -insecure provider.h100.wdc.val.akash.pub:8444 akash.provider.v1.ProviderRPC.GetStatus | jq '.cluster.inventory.cluster.nodes[] | {node: .name, cpu_info: .resources.cpu.info, gpu_info: .resources.gpu.info}'
...
...
}
{
"node": "node6",
"cpu_info": null,
"gpu_info": null
}
Fixed by bouncing the operator-inventory
- kubectl -n akash-services rollout restart deployment/operator-inventory
root@node1:~# kubectl -n akash-services logs deployment/operator-inventory --timestamps |grep -v Ceph
2024-07-27T08:21:07.926076173Z I[2024-07-27|08:21:07.926] using in cluster kube config cmp=provider
2024-07-27T08:21:09.003050511Z INFO rest listening on ":8080"
2024-07-27T08:21:09.003185275Z INFO watcher.storageclasses started
2024-07-27T08:21:09.003263541Z INFO nodes.nodes waiting for nodes to finish
2024-07-27T08:21:09.003347497Z INFO grpc listening on ":8081"
2024-07-27T08:21:09.003855573Z INFO watcher.config started
2024-07-27T08:21:09.005745505Z INFO rook-ceph ADDED monitoring StorageClass {"name": "beta3"}
2024-07-27T08:21:09.009702561Z INFO nodes.node.monitor starting {"node": "node6"}
2024-07-27T08:21:09.009718398Z INFO nodes.node.discovery starting hardware discovery pod {"node": "node3"}
2024-07-27T08:21:09.009739736Z INFO nodes.node.discovery starting hardware discovery pod {"node": "node1"}
2024-07-27T08:21:09.009754127Z INFO nodes.node.monitor starting {"node": "node3"}
2024-07-27T08:21:09.009760696Z INFO nodes.node.discovery starting hardware discovery pod {"node": "node5"}
2024-07-27T08:21:09.009777807Z INFO nodes.node.monitor starting {"node": "node4"}
2024-07-27T08:21:09.009784127Z INFO nodes.node.monitor starting {"node": "node2"}
2024-07-27T08:21:09.009789636Z INFO nodes.node.discovery starting hardware discovery pod {"node": "node4"}
2024-07-27T08:21:09.009795215Z INFO nodes.node.monitor starting {"node": "node1"}
2024-07-27T08:21:09.009806046Z INFO nodes.node.discovery starting hardware discovery pod {"node": "node2"}
2024-07-27T08:21:09.009812526Z INFO nodes.node.monitor starting {"node": "node5"}
2024-07-27T08:21:09.009876008Z INFO nodes.node.discovery starting hardware discovery pod {"node": "node6"}
2024-07-27T08:21:09.015424597Z INFO rancher ADDED monitoring StorageClass {"name": "beta3"}
2024-07-27T08:21:10.842061838Z INFO nodes.node.discovery started hardware discovery pod {"node": "node1"}
2024-07-27T08:21:11.237795598Z INFO nodes.node.monitor started {"node": "node1"}
2024-07-27T08:21:11.504769748Z INFO nodes.node.discovery started hardware discovery pod {"node": "node2"}
2024-07-27T08:21:11.703728596Z INFO nodes.node.discovery started hardware discovery pod {"node": "node4"}
2024-07-27T08:21:12.113093042Z INFO nodes.node.discovery started hardware discovery pod {"node": "node3"}
2024-07-27T08:21:12.198401612Z INFO nodes.node.monitor started {"node": "node4"}
2024-07-27T08:21:12.311559647Z INFO nodes.node.discovery started hardware discovery pod {"node": "node6"}
2024-07-27T08:21:12.370969406Z INFO nodes.node.discovery started hardware discovery pod {"node": "node5"}
2024-07-27T08:21:12.459074609Z INFO nodes.node.monitor started {"node": "node3"}
2024-07-27T08:21:12.794276565Z INFO nodes.node.monitor started {"node": "node6"}
2024-07-27T08:21:12.843802757Z INFO nodes.node.monitor started {"node": "node2"}
2024-07-27T08:21:13.722493046Z INFO nodes.node.monitor started {"node": "node5"}
2024-07-27T08:21:15.228564039Z INFO nodes.node.monitor successfully applied labels and/or annotations patches for node "node6" {"labels": {"akash.network":"true","akash.network/capabilities.gpu.vendor.nvidia.model.h100":"8","akash.network/capabilities.gpu.vendor.nvidia.model.h100.interface.sxm":"8","akash.network/capabilities.gpu.vendor.nvidia.model.h100.ram.80Gi":"8","akash.network/capabilities.storage.class.beta3":"1","nvidia.com/gpu.present":"true"}}
I've noticed that operator inventory is consuming 100% cpu (out of 2 CPU's it is allocated via the helm chart) On Valdi H100 provider:
On Oblivus H100 as well:
Maybe that's something that could contribute to the issue.
The
operator-inventory
occasionally fails to detect the GPU/CPU, resulting in worker nodes remaining unlabeled. Consequently, the GRPC status endpoint returnsnull
forcpu_info
and/orgpu_info
, which in turn affects the Cloudmos / Console API statistics.SW versions
Logs
https://gist.github.com/andy108369/49bcc40a15b85de75cb3f1808a32c1f9