akash-network / support

Akash Support and Issue Tracking
Apache License 2.0
5 stars 4 forks source link

`ip-operator` (provider-services `0.2.1`) won't self-recover #105

Open andy108369 opened 1 year ago

andy108369 commented 1 year ago

I've deployed the akash-provider with the IP leasing enabled in the sandbox. IP leasing was working until I've restarted the provider (all-in-one config) and it stopped working.

To fix it I just had to bounce the IP-operator.

Most likely the IP-Operator did not wait long enough for the metallb services (metallb-controller & metallb-speaker) to initialize, hence it's been failing in loop throwing barrier is locked, can't service request messages.

The IP-Operator should figure how to self-recover.

Additional info / Reproducer

I remember to see this problem before, it is easy to reproduce it. Most likely you just need to stop the metallb services and then bounce the IP-Operator so it falls into barrier is locked, can't service request loop. You can start the metallb services back up again but that won't help, the IP-Operator will keep running in the same broken state until it gets bounced.

We also have documented this issue in the Troubleshooting IP Leases Issues documentation section here https://docs.akash.network/providers/build-a-cloud-provider/ip-leases-provider-enablement-optional-step/troubleshooting-ip-leases-issues

Versions

Symptoms

Logs

provider

I[2023-07-06|16:49:16.337] using in cluster kube config                 cmp=provider
I[2023-07-06|16:49:16.375] using in cluster kube config                 cmp=provider
D[2023-07-06|16:49:16.382] service being found via autodetection        cmp=provider service=ip-operator
D[2023-07-06|16:49:16.382] service being found via autodetection        cmp=provider service=hostname-operator
I[2023-07-06|16:49:16.383] dns discovery success                        cmp=provider cmp=service-discovery-agent addrs="[{Target:akash-hostname-operator.akash-services.svc.cluster.local. Port:8188 Priority:0 Weight:100}]" portName=status service-name=akash-hostname-operator namespace=akash-services
D[2023-07-06|16:49:16.383] satisfying pending requests                  cmp=provider cmp=service-discovery-agent qty=1
I[2023-07-06|16:49:16.384] check result                                 cmp=provider operator=hostname status=200
I[2023-07-06|16:49:16.384] ready                                        cmp=provider cmp=waiter waitable="<*operatorclients.hostnameOperatorClient 0xc0000cc580>"
I[2023-07-06|16:49:16.384] dns discovery success                        cmp=provider cmp=service-discovery-agent addrs="[{Target:akash-ip-operator.akash-services.svc.cluster.local. Port:8086 Priority:0 Weight:100}]" portName=api service-name=akash-ip-operator namespace=akash-services
D[2023-07-06|16:49:16.384] satisfying pending requests                  cmp=provider cmp=service-discovery-agent qty=1
I[2023-07-06|16:49:16.384] starting with existing reservations          module=provider-cluster cmp=provider cmp=service cmp=inventory-service qty=1
I[2023-07-06|16:49:16.385] check result                                 cmp=provider operator=ip status=503
E[2023-07-06|16:49:16.385] not yet ready                                cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc00028f400>" error="ip operator is not yet alive"
D[2023-07-06|16:49:16.386] found existing hostname                      module=provider-cluster cmp=provider cmp=service hostname=caujtgbgst9uj14p0o5fm6vc8k.ingress.sandbox-01.aksh.pw id=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:49:16.387] found orders                                 module=bidengine-service cmp=provider count=0
I[2023-07-06|16:49:16.388] fetched provider attributes                  module=bidengine-service cmp=provider provider=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:49:18.386] check result                                 cmp=provider operator=ip status=503
E[2023-07-06|16:49:18.386] not yet ready                                cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc00028f400>" error="ip operator is not yet alive"
I[2023-07-06|16:49:20.388] check result                                 cmp=provider operator=ip status=503
E[2023-07-06|16:49:20.388] not yet ready                                cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc00028f400>" error="ip operator is not yet alive"
I[2023-07-06|16:49:22.389] check result                                 cmp=provider operator=ip status=503
E[2023-07-06|16:49:22.389] not yet ready                                cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc00028f400>" error="ip operator is not yet alive"

ip-operator

$ kubectl -n akash-services get pods
NAME                                        READY   STATUS    RESTARTS      AGE
akash-hostname-operator-86d4596d6c-gqmdj    1/1     Running   1 (58m ago)   144m
akash-inventory-operator-69464fbdff-5mlzm   1/1     Running   1 (58m ago)   141m
akash-ip-operator-5d858f8fc8-d7kbj          1/1     Running   1 (58m ago)   76m
akash-provider-0                            1/1     Running   0             2m56s

$ kubectl -n akash-services logs akash-ip-operator-5d858f8fc8-d7kbj
I[2023-07-06|16:47:07.188] using in cluster kube config                 operator=ip
D[2023-07-06|16:47:07.198] service being found via autodetection        operator=ip service=metal-lb
I[2023-07-06|16:47:07.198] using in cluster kube config                 operator=ip
I[2023-07-06|16:47:07.198] clients                                      operator=ip kube="kube client 0xc0015c9f20 ns=lease" metallb="metal LB client 0xc0015ec1e0"
I[2023-07-06|16:47:07.198] HTTP listening                               operator=ip address=0.0.0.0:8086
I[2023-07-06|16:47:07.199] ip operator start                            operator=ip
I[2023-07-06|16:47:07.199] associated provider                          operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:47:07.199] fetching existing IP passthroughs            operator=ip
E[2023-07-06|16:47:17.206] dns discovery failed                         operator=ip cmp=service-discovery-agent error="lookup _monitoring._TCP.controller.metallb-system.svc.cluster.local on 169.254.25.10:53: server misbehaving" portName=monitoring service-name=controller namespace=metallb-system
D[2023-07-06|16:47:17.206] satisfying pending requests                  operator=ip cmp=service-discovery-agent qty=1
E[2023-07-06|16:47:17.206] observation stopped                          operator=ip err="lookup _monitoring._TCP.controller.metallb-system.svc.cluster.local on 169.254.25.10:53: server misbehaving"
I[2023-07-06|16:47:17.206] associated provider                          operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:47:17.206] fetching existing IP passthroughs            operator=ip
E[2023-07-06|16:48:17.209] observation stopped                          operator=ip err="context deadline exceeded"
I[2023-07-06|16:48:17.209] associated provider                          operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:48:17.209] fetching existing IP passthroughs            operator=ip
E[2023-07-06|16:49:16.385] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:17.212] observation stopped                          operator=ip err="context deadline exceeded"
I[2023-07-06|16:49:17.212] associated provider                          operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:49:17.212] fetching existing IP passthroughs            operator=ip
E[2023-07-06|16:49:18.386] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:20.388] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:22.389] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:24.390] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:26.391] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:28.392] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:30.393] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:32.394] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:34.395] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:36.397] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:38.398] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:40.399] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:42.400] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:44.401] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:46.403] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:48.404] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:50.405] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:52.407] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:54.408] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:56.410] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:49:58.411] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:00.412] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:02.413] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:04.415] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:06.416] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:08.417] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:10.419] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:12.420] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:14.422] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:16.423] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:17.215] observation stopped                          operator=ip err="context deadline exceeded"
I[2023-07-06|16:50:17.215] associated provider                          operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:50:17.215] fetching existing IP passthroughs            operator=ip
E[2023-07-06|16:50:18.425] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:20.426] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:22.428] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:24.430] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:26.431] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:28.432] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:30.433] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:32.435] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:34.436] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:36.437] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:38.438] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:40.439] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:42.441] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:44.442] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:46.443] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:48.445] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:50.446] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:52.448] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:54.449] barrier is locked, can't service request     operator=ip path=/health
E[2023-07-06|16:50:56.450] barrier is locked, can't service request     operator=ip path=/health

Workaround

Workaround is to kick the IP Operator (akash-provider does NOT have to be kicked, it will automatically recover once ip-operator recovers):

$ kubectl -n akash-services delete pod akash-ip-operator-5d858f8fc8-d7kbj
pod "akash-ip-operator-5d858f8fc8-d7kbj" deleted
$ kubectl -n akash-services logs akash-ip-operator-5d858f8fc8-n6llv
I[2023-07-06|17:45:45.263] using in cluster kube config                 operator=ip
D[2023-07-06|17:45:45.268] service being found via autodetection        operator=ip service=metal-lb
I[2023-07-06|17:45:45.268] using in cluster kube config                 operator=ip
I[2023-07-06|17:45:45.268] clients                                      operator=ip kube="kube client 0xc000ff5f20 ns=lease" metallb="metal LB client 0xc0010421e0"
I[2023-07-06|17:45:45.268] HTTP listening                               operator=ip address=0.0.0.0:8086
I[2023-07-06|17:45:45.268] ip operator start                            operator=ip
I[2023-07-06|17:45:45.269] associated provider                          operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|17:45:45.269] fetching existing IP passthroughs            operator=ip
I[2023-07-06|17:45:45.272] dns discovery success                        operator=ip cmp=service-discovery-agent addrs="[{Target:controller.metallb-system.svc.cluster.local. Port:7472 Priority:0 Weight:100}]" portName=monitoring service-name=controller namespace=metallb-system
D[2023-07-06|17:45:45.272] satisfying pending requests                  operator=ip cmp=service-discovery-agent qty=1
I[2023-07-06|17:45:45.274] ip address inventory                         operator=ip in-use=1 available=3
I[2023-07-06|17:45:45.274] starting observation                         operator=ip
I[2023-07-06|17:45:45.276] starting ip passthrough watch                operator=ip client=kube resourceVersion=22730
I[2023-07-06|17:45:45.277] barrier can now be passed                    operator=ip
D[2023-07-06|17:45:45.277] apply event                                  operator=ip event-type=add lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|17:45:45.277] connecting                                   operator=ip lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8 service=supermario externalPort=80
D[2023-07-06|17:45:45.277] ip passthrough has changed, applying         operator=ip lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
D[2023-07-06|17:45:45.277] Updating ip passthrough                      operator=ip lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
D[2023-07-06|17:45:45.278] metal LB config change event                 operator=ip client=metallb event-type=ADDED
D[2023-07-06|17:45:45.279] creating metal-lb service                    operator=ip client=metallb service=supermario port=8080 external-port=80 sharing-key=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h-ip-superippersist exists=true
I[2023-07-06|17:45:45.282] update complete                              operator=ip lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|17:45:45.284] ip address inventory                         operator=ip in-use=1 available=3
I[2023-07-06|17:45:45.286] ip address inventory                         operator=ip in-use=1 available=3
D[2023-07-06|17:45:47.486] apply event                                  operator=ip event-type=delete lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|17:45:47.506] ip address inventory                         operator=ip in-use=0 available=3
$ curl -sk https://provider.sandbox-01.aksh.pw:8443/status | jq -r
{
  "cluster": {
    "leases": 0,
    "inventory": {
      "available": {
        "nodes": [
          {
            "cpu": 17650,
            "memory": 22853373952,
            "storage_ephemeral": 119292733243
          }
        ],
        "storage": [
          {
            "class": "beta3",
            "size": 130054156288
          }
        ]
      }
    }
  },
  "bidengine": {
    "orders": 0
  },
  "manifest": {
    "deployments": 0
  },
  "cluster_public_hostname": "provider.sandbox-01.aksh.pw",
  "address": "akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8"
}

Additional ideas

Until it gets fixed in the ip-operator, we can try to leverage the livenessProbe/readinessProbe so the pod restarts when it sees the ip-operator hasn't been ready/functioning for longer than 10 minutes or so.

andy108369 commented 5 months ago

just for the record: this is still an issue with provider 0.5.13

planning on making a workaround until it gets solved in the go code https://github.com/akash-network/helm-charts/issues/276