akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

operator-inventory keeps restarting `panic: interface conversion: runtime.Object is *v1.Status, not *v1.Pod` #222

Closed andy108369 closed 1 month ago

andy108369 commented 1 month ago

seeing the following errors on the provider.medc1.com provider:

$ kubectl -n akash-services logs operator-inventory-dfdd44d64-zxcfn  --timestamps
2024-05-07T11:01:25.772363804Z I[2024-05-07|11:01:25.772] using in cluster kube config                 cmp=provider
2024-05-07T11:01:25.888968153Z INFO nodes.nodes waiting for nodes to finish
2024-05-07T11:01:25.888990335Z INFO watcher.storageclasses  started
2024-05-07T11:01:25.889143853Z INFO rest listening on ":8080"
2024-05-07T11:01:25.889255333Z INFO grpc listening on ":8081"
2024-05-07T11:01:25.889400085Z INFO watcher.config  started
2024-05-07T11:01:25.895360878Z INFO rook-ceph      ADDED monitoring StorageClass    {"name": "local-path"}
2024-05-07T11:01:25.899218417Z INFO nodes.node.monitor  starting    {"node": "node4"}
2024-05-07T11:01:25.899222505Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "node1"}
2024-05-07T11:01:25.899228226Z INFO nodes.node.monitor  starting    {"node": "node2"}
2024-05-07T11:01:25.899231632Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "node2"}
2024-05-07T11:01:25.899274843Z INFO nodes.node.monitor  starting    {"node": "xg-4090-002"}
2024-05-07T11:01:25.899310170Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "node4"}
2024-05-07T11:01:25.899313075Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "xg-4090-002"}
2024-05-07T11:01:25.899337712Z INFO nodes.node.monitor  starting    {"node": "node1"}
2024-05-07T11:01:25.899340256Z INFO nodes.node.monitor  starting    {"node": "xg-4090-003"}
2024-05-07T11:01:25.899360544Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "xg-4090-003"}
2024-05-07T11:01:25.899363059Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "xg-4090-001"}
2024-05-07T11:01:25.899379300Z INFO nodes.node.monitor  starting    {"node": "xg-4090-001"}
2024-05-07T11:01:25.899428402Z INFO nodes.node.monitor  starting    {"node": "xg-4090-004"}
2024-05-07T11:01:25.899431077Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "xg-4090-004"}
2024-05-07T11:01:25.899504144Z INFO nodes.node.monitor  starting    {"node": "xg-4090-005"}
2024-05-07T11:01:25.899528570Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "xg-4090-005"}
2024-05-07T11:01:25.899570008Z INFO nodes.node.monitor  starting    {"node": "xg-4090-006"}
2024-05-07T11:01:25.899594534Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "xg-4090-006"}
2024-05-07T11:01:25.899639298Z INFO nodes.node.monitor  starting    {"node": "xg-4090-007"}
2024-05-07T11:01:25.899672901Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "xg-4090-007"}
2024-05-07T11:01:25.899729508Z INFO nodes.node.monitor  starting    {"node": "xg-4090-008"}
2024-05-07T11:01:25.899756659Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "xg-4090-008"}
2024-05-07T11:01:25.899869381Z INFO rancher    ADDED monitoring StorageClass    {"name": "local-path"}
2024-05-07T11:01:25.899892244Z INFO nodes.node.monitor  starting    {"node": "xg-4090-009"}
2024-05-07T11:01:25.899894788Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "xg-4090-009"}
2024-05-07T11:01:25.900035313Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "xg-4090-010"}
2024-05-07T11:01:25.900041815Z INFO nodes.node.monitor  starting    {"node": "xg-4090-010"}
2024-05-07T11:01:25.900081700Z INFO nodes.node.monitor  starting    {"node": "xg3"}
2024-05-07T11:01:25.900097479Z INFO nodes.node.discovery    starting hardware discovery pod {"node": "xg3"}
2024-05-07T11:01:27.831711875Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "xg-4090-010"}
2024-05-07T11:01:27.832484658Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "xg-4090-006"}
2024-05-07T11:01:27.867445812Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "xg-4090-007"}
2024-05-07T11:01:27.869880457Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "xg-4090-005"}
2024-05-07T11:01:27.986084014Z INFO nodes.node.monitor  started {"node": "xg-4090-005"}
2024-05-07T11:01:28.411173473Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "xg-4090-001"}
2024-05-07T11:01:28.451953112Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "xg-4090-002"}
2024-05-07T11:01:28.457704931Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "xg3"}
2024-05-07T11:01:28.492491468Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "node1"}
2024-05-07T11:01:28.504803947Z INFO nodes.node.monitor  started {"node": "xg-4090-001"}
2024-05-07T11:01:28.559504638Z INFO nodes.node.monitor  started {"node": "xg-4090-002"}
2024-05-07T11:01:28.589371298Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "xg-4090-008"}
2024-05-07T11:01:28.615090122Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "node4"}
2024-05-07T11:01:28.703609732Z INFO nodes.node.monitor  started {"node": "xg-4090-008"}
2024-05-07T11:01:28.732181427Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "xg-4090-004"}
2024-05-07T11:01:28.834987215Z INFO nodes.node.monitor  started {"node": "xg-4090-004"}
2024-05-07T11:01:28.883371365Z INFO nodes.node.monitor  started {"node": "node1"}
2024-05-07T11:01:28.893755821Z INFO nodes.node.monitor  started {"node": "node4"}
2024-05-07T11:01:28.950694098Z INFO nodes.node.monitor  started {"node": "xg-4090-006"}
2024-05-07T11:01:28.956650192Z INFO nodes.node.monitor  started {"node": "xg-4090-010"}
2024-05-07T11:01:28.958932581Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "xg-4090-003"}
2024-05-07T11:01:28.979685423Z INFO nodes.node.monitor  started {"node": "xg-4090-007"}
2024-05-07T11:01:29.083981691Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "xg-4090-009"}
2024-05-07T11:01:29.090695338Z INFO nodes.node.monitor  started {"node": "xg-4090-003"}
2024-05-07T11:01:29.189918924Z INFO nodes.node.monitor  started {"node": "xg-4090-009"}
2024-05-07T11:01:29.224653021Z INFO nodes.node.discovery    started hardware discovery pod  {"node": "node2"}
2024-05-07T11:01:29.417372818Z INFO nodes.node.monitor  started {"node": "node2"}
2024-05-07T11:01:29.426747266Z INFO nodes.node.monitor  successfully applied labels and/or annotations patches for node "node2" {"labels": {"akash.network":"true","akash.network/capabilities.gpu.vendor.nvidia.model.t4":"1","akash.network/capabilities.gpu.vendor.nvidia.model.t4.interface.pcie":"1","akash.network/capabilities.gpu.vendor.nvidia.model.t4.ram.16Gi":"1","nvidia.com/gpu.present":"true"}}
2024-05-07T11:01:29.677761599Z INFO nodes.node.monitor  started {"node": "xg3"}
2024-05-07T11:04:15.652488204Z INFO nodes.node.monitor  shutting down monitor   {"node": "xg-4090-001"}
2024-05-07T11:04:15.652507019Z INFO nodes.node.monitor  shutting down monitor   {"node": "xg-4090-010"}
2024-05-07T11:04:15.652530985Z INFO nodes.node.monitor  shutting down monitor   {"node": "xg-4090-002"}
2024-05-07T11:04:15.652559648Z W0507 11:04:15.652437       7 reflector.go:347] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: watch of *v1.StorageClass ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2024-05-07T11:04:15.652572242Z INFO nodes.node.monitor  shutting down monitor   {"node": "xg-4090-007"}
2024-05-07T11:04:15.652604322Z INFO nodes.node.monitor  shutting down monitor   {"node": "node1"}
2024-05-07T11:04:15.652615724Z W0507 11:04:15.652571       7 reflector.go:347] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2024-05-07T11:04:15.652629319Z W0507 11:04:15.652591       7 reflector.go:347] k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: watch of *v1.PersistentVolume ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
2024-05-07T11:04:15.655867896Z panic: interface conversion: runtime.Object is *v1.Status, not *v1.Pod
2024-05-07T11:04:15.655874839Z 
2024-05-07T11:04:15.655879859Z goroutine 123 [running]:
2024-05-07T11:04:15.655891100Z github.com/akash-network/provider/operator/inventory.(*nodeDiscovery).monitor(0xc001c16180)
2024-05-07T11:04:15.655900267Z  github.com/akash-network/provider/operator/inventory/node-discovery.go:524 +0x24c8
2024-05-07T11:04:15.655913863Z golang.org/x/sync/errgroup.(*Group).Go.func1()
2024-05-07T11:04:15.655929923Z  golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78 +0x56
2024-05-07T11:04:15.655941735Z created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 301
2024-05-07T11:04:15.655946494Z  golang.org/x/sync@v0.7.0/errgroup/errgroup.go:75 +0x96
$ kubectl -n akash-services get pods -l app.kubernetes.io/name=inventory -o wide
NAME                                                READY   STATUS    RESTARTS         AGE     IP               NODE          NOMINATED NODE   READINESS GATES
operator-inventory-dfdd44d64-zxcfn                  1/1     Running   43 (7m56s ago)   5h13m   10.233.102.129   node1         <none>           <none>
operator-inventory-hardware-discovery-node1         1/1     Running   0                2m44s   10.233.102.159   node1         <none>           <none>
operator-inventory-hardware-discovery-node2         1/1     Running   0                2m43s   10.233.75.47     node2         <none>           <none>
operator-inventory-hardware-discovery-node4         1/1     Running   0                2m44s   10.233.74.67     node4         <none>           <none>
operator-inventory-hardware-discovery-xg-4090-001   1/1     Running   0                2m44s   10.233.71.193    xg-4090-001   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-002   1/1     Running   0                2m44s   10.233.120.60    xg-4090-002   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-003   1/1     Running   0                2m43s   10.233.100.127   xg-4090-003   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-004   1/1     Running   0                2m44s   10.233.104.59    xg-4090-004   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-005   1/1     Running   0                2m44s   10.233.70.51     xg-4090-005   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-006   1/1     Running   0                2m44s   10.233.70.185    xg-4090-006   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-007   1/1     Running   0                2m44s   10.233.109.233   xg-4090-007   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-008   1/1     Running   0                2m44s   10.233.77.109    xg-4090-008   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-009   1/1     Running   0                2m43s   10.233.100.242   xg-4090-009   <none>           <none>
operator-inventory-hardware-discovery-xg-4090-010   1/1     Running   0                2m44s   10.233.105.236   xg-4090-010   <none>           <none>
operator-inventory-hardware-discovery-xg3           1/1     Running   0                2m44s   10.233.64.148    xg3           <none>           <none>

SW version

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                                IMAGE
akash-node-1-0                                      ghcr.io/akash-network/node:0.34.1
akash-provider-0                                    ghcr.io/akash-network/provider:0.6.1
operator-hostname-7b98cb78db-xdc9r                  ghcr.io/akash-network/provider:0.6.1
operator-inventory-dfdd44d64-zxcfn                  ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-node1         ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-node2         ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-node4         ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-001   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-002   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-003   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-004   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-005   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-006   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-007   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-008   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-009   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg-4090-010   ghcr.io/akash-network/provider:0.6.1
operator-inventory-hardware-discovery-xg3           ghcr.io/akash-network/provider:0.6.1

kubectl events log

medc1-kubectl-events.log

K8s nodes

$ kubectl get nodes -o wide
NAME          STATUS   ROLES           AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
node1         Ready    control-plane   160d    v1.27.7   192.168.99.41   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
node2         Ready    control-plane   17d     v1.27.7   192.168.99.42   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
node4         Ready    <none>          17d     v1.27.7   192.168.99.44   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-001   Ready    <none>          8d      v1.27.7   192.168.99.71   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-002   Ready    <none>          6d22h   v1.27.7   192.168.99.72   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-003   Ready    <none>          8d      v1.27.7   192.168.99.73   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-004   Ready    <none>          25d     v1.27.7   192.168.99.74   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-005   Ready    <none>          24d     v1.27.7   192.168.99.75   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-006   Ready    <none>          24d     v1.27.7   192.168.99.76   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-007   Ready    <none>          24d     v1.27.7   192.168.99.77   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-008   Ready    <none>          23d     v1.27.7   192.168.99.78   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-009   Ready    <none>          21d     v1.27.7   192.168.99.79   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg-4090-010   Ready    <none>          21d     v1.27.7   192.168.99.80   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
xg3           Ready    <none>          28d     v1.27.7   192.168.99.53   <none>        Ubuntu 22.04.4 LTS   5.15.0-105-generic   containerd://1.7.5
andy108369 commented 1 month ago

Might be related to something is wrong on the node1:

$ kubectl get events -A --sort-by='.lastTimestamp' | grep dial
ingress-nginx                                   22m         Warning   Unhealthy   pod/ingress-nginx-controller-6v2kb                      Readiness probe failed: Get "http://10.233.102.170:10254/healthz": dial tcp 10.233.102.170:10254: connect: invalid argument
kube-system                                     16m         Warning   Unhealthy   pod/coredns-5c469774b8-wzfkh                            Liveness probe failed: Get "http://10.233.102.151:8080/health": dial tcp 10.233.102.151:8080: connect: invalid argument
ingress-nginx                                   7m49s       Warning   Unhealthy   pod/ingress-nginx-controller-6v2kb                      Liveness probe failed: Get "http://10.233.102.170:10254/healthz": dial tcp 10.233.102.170:10254: connect: invalid argument
kube-system                                     57s         Warning   Unhealthy   pod/coredns-5c469774b8-wzfkh                            Readiness probe failed: Get "http://10.233.102.151:8181/ready": dial tcp 10.233.102.151:8181: connect: invalid argument

$ kubectl get pods -A -o wide |grep 10.233.102.170
ingress-nginx                                   ingress-nginx-controller-6v2kb                      0/1     CrashLoopBackOff   151 (11s ago)   40d     10.233.102.170   node1         <none>           <none>

$ kubectl get pods -A -o wide |grep 10.233.102.151
kube-system                                     coredns-5c469774b8-wzfkh                            1/1     Running            20 (175m ago)   12d     10.233.102.151   node1         <none>           <none>

Will check with medc1.

andy108369 commented 1 month ago

They've corrected the issue on their end.

Will re-open in case if I see these errors again.