Bug: `operator-inventory` fails to detect GPU/CPU, causing unlabeled nodes and `null` in GRPC status

andy108369 commented 3 months ago

The operator-inventory occasionally fails to detect the GPU/CPU, resulting in worker nodes remaining unlabeled. Consequently, the GRPC status endpoint returns null for cpu_info and/or gpu_info, which in turn affects the Cloudmos / Console API statistics.

curl https://api.cloudmos.io/internal/gpu | jq '.gpus.details.nvidia[] | select(.model == "rtx4090")'

SW versions

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                          IMAGE
akash-node-1-0                                ghcr.io/akash-network/node:0.36.0
akash-provider-0                              ghcr.io/akash-network/provider:0.6.2
operator-hostname-6dddc6db79-kj48g            ghcr.io/akash-network/provider:0.6.2
operator-inventory-55776b97f7-ksrt4           ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node1   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node2   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node3   ghcr.io/akash-network/provider:0.6.2

Logs

https://gist.github.com/andy108369/49bcc40a15b85de75cb3f1808a32c1f9

andy108369 commented 3 months ago

Have observed the same on provider.h100.wdc.val.akash.pub provider:

$ grpcurl -insecure provider.h100.wdc.val.akash.pub:8444 akash.provider.v1.ProviderRPC.GetStatus | jq '.cluster.inventory.cluster.nodes[] | {node: .name, cpu_info: .resources.cpu.info, gpu_info: .resources.gpu.info}'
...
...
}
{
  "node": "node6",
  "cpu_info": null,
  "gpu_info": null
}

Fixed by bouncing the operator-inventory - kubectl -n akash-services rollout restart deployment/operator-inventory

root@node1:~# kubectl -n akash-services logs deployment/operator-inventory --timestamps  |grep -v Ceph
2024-07-27T08:21:07.926076173Z I[2024-07-27|08:21:07.926] using in cluster kube config                 cmp=provider
2024-07-27T08:21:09.003050511Z INFO rest listening on ":8080"
2024-07-27T08:21:09.003185275Z INFO watcher.storageclasses  started
2024-07-27T08:21:09.003263541Z INFO nodes.nodes waiting for nodes to finish
2024-07-27T08:21:09.003347497Z INFO grpc listening on ":8081"
2024-07-27T08:21:09.003855573Z INFO watcher.config  started
2024-07-27T08:21:09.005745505Z INFO rook-ceph      ADDED monitoring StorageClass    {"name": "beta3"}
2024-07-27T08:21:09.009702561Z INFO nodes.node.monitor  starting    {"node": "node6"}
2024-07-27T08:21:09.009718398Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "node3"}
2024-07-27T08:21:09.009739736Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "node1"}
2024-07-27T08:21:09.009754127Z INFO nodes.node.monitor  starting    {"node": "node3"}
2024-07-27T08:21:09.009760696Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "node5"}
2024-07-27T08:21:09.009777807Z INFO nodes.node.monitor  starting    {"node": "node4"}
2024-07-27T08:21:09.009784127Z INFO nodes.node.monitor  starting    {"node": "node2"}
2024-07-27T08:21:09.009789636Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "node4"}
2024-07-27T08:21:09.009795215Z INFO nodes.node.monitor  starting    {"node": "node1"}
2024-07-27T08:21:09.009806046Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "node2"}
2024-07-27T08:21:09.009812526Z INFO nodes.node.monitor  starting    {"node": "node5"}
2024-07-27T08:21:09.009876008Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "node6"}
2024-07-27T08:21:09.015424597Z INFO rancher    ADDED monitoring StorageClass    {"name": "beta3"}
2024-07-27T08:21:10.842061838Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "node1"}
2024-07-27T08:21:11.237795598Z INFO nodes.node.monitor  started {"node": "node1"}
2024-07-27T08:21:11.504769748Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "node2"}
2024-07-27T08:21:11.703728596Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "node4"}
2024-07-27T08:21:12.113093042Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "node3"}
2024-07-27T08:21:12.198401612Z INFO nodes.node.monitor  started {"node": "node4"}
2024-07-27T08:21:12.311559647Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "node6"}
2024-07-27T08:21:12.370969406Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "node5"}
2024-07-27T08:21:12.459074609Z INFO nodes.node.monitor  started {"node": "node3"}
2024-07-27T08:21:12.794276565Z INFO nodes.node.monitor  started {"node": "node6"}
2024-07-27T08:21:12.843802757Z INFO nodes.node.monitor  started {"node": "node2"}
2024-07-27T08:21:13.722493046Z INFO nodes.node.monitor  started {"node": "node5"}
2024-07-27T08:21:15.228564039Z INFO nodes.node.monitor  successfully applied labels and/or annotations patches for node "node6" {"labels": {"akash.network":"true","akash.network/capabilities.gpu.vendor.nvidia.model.h100":"8","akash.network/capabilities.gpu.vendor.nvidia.model.h100.interface.sxm":"8","akash.network/capabilities.gpu.vendor.nvidia.model.h100.ram.80Gi":"8","akash.network/capabilities.storage.class.beta3":"1","nvidia.com/gpu.present":"true"}}

andy108369 commented 3 months ago

I've noticed that operator inventory is consuming 100% cpu (out of 2 CPU's it is allocated via the helm chart) On Valdi H100 provider:

On Oblivus H100 as well:

Maybe that's something that could contribute to the issue.

akash-network / support

Bug: `operator-inventory` fails to detect GPU/CPU, causing unlabeled nodes and `null` in GRPC status #240

SW versions

Logs