I've deployed the akash-provider with the IP leasing enabled in the sandbox.
IP leasing was working until I've restarted the provider (all-in-one config) and it stopped working.
To fix it I just had to bounce the IP-operator.
Most likely the IP-Operator did not wait long enough for the metallb services (metallb-controller & metallb-speaker) to initialize, hence it's been failing in loop throwing barrier is locked, can't service request messages.
The IP-Operator should figure how to self-recover.
Additional info / Reproducer
I remember to see this problem before, it is easy to reproduce it.
Most likely you just need to stop the metallb services and then bounce the IP-Operator so it falls into barrier is locked, can't service request loop. You can start the metallb services back up again but that won't help, the IP-Operator will keep running in the same broken state until it gets bounced.
provider won't reply upon curl -sk https://provider.sandbox-01.aksh.pw:8443/status (curl will just hang though probing the port using nc -vz provider 8443 succeeds);
Logs
provider
I[2023-07-06|16:49:16.337] using in cluster kube config cmp=provider
I[2023-07-06|16:49:16.375] using in cluster kube config cmp=provider
D[2023-07-06|16:49:16.382] service being found via autodetection cmp=provider service=ip-operator
D[2023-07-06|16:49:16.382] service being found via autodetection cmp=provider service=hostname-operator
I[2023-07-06|16:49:16.383] dns discovery success cmp=provider cmp=service-discovery-agent addrs="[{Target:akash-hostname-operator.akash-services.svc.cluster.local. Port:8188 Priority:0 Weight:100}]" portName=status service-name=akash-hostname-operator namespace=akash-services
D[2023-07-06|16:49:16.383] satisfying pending requests cmp=provider cmp=service-discovery-agent qty=1
I[2023-07-06|16:49:16.384] check result cmp=provider operator=hostname status=200
I[2023-07-06|16:49:16.384] ready cmp=provider cmp=waiter waitable="<*operatorclients.hostnameOperatorClient 0xc0000cc580>"
I[2023-07-06|16:49:16.384] dns discovery success cmp=provider cmp=service-discovery-agent addrs="[{Target:akash-ip-operator.akash-services.svc.cluster.local. Port:8086 Priority:0 Weight:100}]" portName=api service-name=akash-ip-operator namespace=akash-services
D[2023-07-06|16:49:16.384] satisfying pending requests cmp=provider cmp=service-discovery-agent qty=1
I[2023-07-06|16:49:16.384] starting with existing reservations module=provider-cluster cmp=provider cmp=service cmp=inventory-service qty=1
I[2023-07-06|16:49:16.385] check result cmp=provider operator=ip status=503
E[2023-07-06|16:49:16.385] not yet ready cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc00028f400>" error="ip operator is not yet alive"
D[2023-07-06|16:49:16.386] found existing hostname module=provider-cluster cmp=provider cmp=service hostname=caujtgbgst9uj14p0o5fm6vc8k.ingress.sandbox-01.aksh.pw id=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:49:16.387] found orders module=bidengine-service cmp=provider count=0
I[2023-07-06|16:49:16.388] fetched provider attributes module=bidengine-service cmp=provider provider=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:49:18.386] check result cmp=provider operator=ip status=503
E[2023-07-06|16:49:18.386] not yet ready cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc00028f400>" error="ip operator is not yet alive"
I[2023-07-06|16:49:20.388] check result cmp=provider operator=ip status=503
E[2023-07-06|16:49:20.388] not yet ready cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc00028f400>" error="ip operator is not yet alive"
I[2023-07-06|16:49:22.389] check result cmp=provider operator=ip status=503
E[2023-07-06|16:49:22.389] not yet ready cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc00028f400>" error="ip operator is not yet alive"
ip-operator
$ kubectl -n akash-services get pods
NAME READY STATUS RESTARTS AGE
akash-hostname-operator-86d4596d6c-gqmdj 1/1 Running 1 (58m ago) 144m
akash-inventory-operator-69464fbdff-5mlzm 1/1 Running 1 (58m ago) 141m
akash-ip-operator-5d858f8fc8-d7kbj 1/1 Running 1 (58m ago) 76m
akash-provider-0 1/1 Running 0 2m56s
$ kubectl -n akash-services logs akash-ip-operator-5d858f8fc8-d7kbj
I[2023-07-06|16:47:07.188] using in cluster kube config operator=ip
D[2023-07-06|16:47:07.198] service being found via autodetection operator=ip service=metal-lb
I[2023-07-06|16:47:07.198] using in cluster kube config operator=ip
I[2023-07-06|16:47:07.198] clients operator=ip kube="kube client 0xc0015c9f20 ns=lease" metallb="metal LB client 0xc0015ec1e0"
I[2023-07-06|16:47:07.198] HTTP listening operator=ip address=0.0.0.0:8086
I[2023-07-06|16:47:07.199] ip operator start operator=ip
I[2023-07-06|16:47:07.199] associated provider operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:47:07.199] fetching existing IP passthroughs operator=ip
E[2023-07-06|16:47:17.206] dns discovery failed operator=ip cmp=service-discovery-agent error="lookup _monitoring._TCP.controller.metallb-system.svc.cluster.local on 169.254.25.10:53: server misbehaving" portName=monitoring service-name=controller namespace=metallb-system
D[2023-07-06|16:47:17.206] satisfying pending requests operator=ip cmp=service-discovery-agent qty=1
E[2023-07-06|16:47:17.206] observation stopped operator=ip err="lookup _monitoring._TCP.controller.metallb-system.svc.cluster.local on 169.254.25.10:53: server misbehaving"
I[2023-07-06|16:47:17.206] associated provider operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:47:17.206] fetching existing IP passthroughs operator=ip
E[2023-07-06|16:48:17.209] observation stopped operator=ip err="context deadline exceeded"
I[2023-07-06|16:48:17.209] associated provider operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:48:17.209] fetching existing IP passthroughs operator=ip
E[2023-07-06|16:49:16.385] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:17.212] observation stopped operator=ip err="context deadline exceeded"
I[2023-07-06|16:49:17.212] associated provider operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:49:17.212] fetching existing IP passthroughs operator=ip
E[2023-07-06|16:49:18.386] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:20.388] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:22.389] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:24.390] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:26.391] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:28.392] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:30.393] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:32.394] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:34.395] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:36.397] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:38.398] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:40.399] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:42.400] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:44.401] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:46.403] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:48.404] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:50.405] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:52.407] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:54.408] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:56.410] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:49:58.411] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:00.412] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:02.413] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:04.415] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:06.416] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:08.417] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:10.419] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:12.420] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:14.422] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:16.423] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:17.215] observation stopped operator=ip err="context deadline exceeded"
I[2023-07-06|16:50:17.215] associated provider operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|16:50:17.215] fetching existing IP passthroughs operator=ip
E[2023-07-06|16:50:18.425] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:20.426] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:22.428] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:24.430] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:26.431] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:28.432] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:30.433] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:32.435] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:34.436] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:36.437] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:38.438] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:40.439] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:42.441] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:44.442] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:46.443] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:48.445] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:50.446] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:52.448] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:54.449] barrier is locked, can't service request operator=ip path=/health
E[2023-07-06|16:50:56.450] barrier is locked, can't service request operator=ip path=/health
Workaround
Workaround is to kick the IP Operator (akash-provider does NOT have to be kicked, it will automatically recover once ip-operator recovers):
$ kubectl -n akash-services delete pod akash-ip-operator-5d858f8fc8-d7kbj
pod "akash-ip-operator-5d858f8fc8-d7kbj" deleted
$ kubectl -n akash-services logs akash-ip-operator-5d858f8fc8-n6llv
I[2023-07-06|17:45:45.263] using in cluster kube config operator=ip
D[2023-07-06|17:45:45.268] service being found via autodetection operator=ip service=metal-lb
I[2023-07-06|17:45:45.268] using in cluster kube config operator=ip
I[2023-07-06|17:45:45.268] clients operator=ip kube="kube client 0xc000ff5f20 ns=lease" metallb="metal LB client 0xc0010421e0"
I[2023-07-06|17:45:45.268] HTTP listening operator=ip address=0.0.0.0:8086
I[2023-07-06|17:45:45.268] ip operator start operator=ip
I[2023-07-06|17:45:45.269] associated provider operator=ip addr=akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|17:45:45.269] fetching existing IP passthroughs operator=ip
I[2023-07-06|17:45:45.272] dns discovery success operator=ip cmp=service-discovery-agent addrs="[{Target:controller.metallb-system.svc.cluster.local. Port:7472 Priority:0 Weight:100}]" portName=monitoring service-name=controller namespace=metallb-system
D[2023-07-06|17:45:45.272] satisfying pending requests operator=ip cmp=service-discovery-agent qty=1
I[2023-07-06|17:45:45.274] ip address inventory operator=ip in-use=1 available=3
I[2023-07-06|17:45:45.274] starting observation operator=ip
I[2023-07-06|17:45:45.276] starting ip passthrough watch operator=ip client=kube resourceVersion=22730
I[2023-07-06|17:45:45.277] barrier can now be passed operator=ip
D[2023-07-06|17:45:45.277] apply event operator=ip event-type=add lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|17:45:45.277] connecting operator=ip lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8 service=supermario externalPort=80
D[2023-07-06|17:45:45.277] ip passthrough has changed, applying operator=ip lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
D[2023-07-06|17:45:45.277] Updating ip passthrough operator=ip lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
D[2023-07-06|17:45:45.278] metal LB config change event operator=ip client=metallb event-type=ADDED
D[2023-07-06|17:45:45.279] creating metal-lb service operator=ip client=metallb service=supermario port=8080 external-port=80 sharing-key=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h-ip-superippersist exists=true
I[2023-07-06|17:45:45.282] update complete operator=ip lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|17:45:45.284] ip address inventory operator=ip in-use=1 available=3
I[2023-07-06|17:45:45.286] ip address inventory operator=ip in-use=1 available=3
D[2023-07-06|17:45:47.486] apply event operator=ip event-type=delete lease=akash1h24fljt7p0nh82cq0za0uhsct3sfwsfu9w3c9h/598/1/1/akash133fr5pdkl9xk8gg79um3zc84eunymluvqsvfz8
I[2023-07-06|17:45:47.506] ip address inventory operator=ip in-use=0 available=3
Until it gets fixed in the ip-operator, we can try to leverage the livenessProbe/readinessProbe so the pod restarts when it sees the ip-operator hasn't been ready/functioning for longer than 10 minutes or so.
I've deployed the akash-provider with the IP leasing enabled in the
sandbox
. IP leasing was working until I've restarted the provider (all-in-one config) and it stopped working.To fix it I just had to bounce the IP-operator.
Most likely the IP-Operator did not wait long enough for the metallb services (metallb-controller & metallb-speaker) to initialize, hence it's been failing in loop throwing
barrier is locked, can't service request
messages.The IP-Operator should figure how to self-recover.
Additional info / Reproducer
I remember to see this problem before, it is easy to reproduce it. Most likely you just need to stop the metallb services and then bounce the IP-Operator so it falls into
barrier is locked, can't service request
loop. You can start the metallb services back up again but that won't help, the IP-Operator will keep running in the same broken state until it gets bounced.We also have documented this issue in the
Troubleshooting IP Leases Issues
documentation section here https://docs.akash.network/providers/build-a-cloud-provider/ip-leases-provider-enablement-optional-step/troubleshooting-ip-leases-issuesVersions
ghcr.io/akash-network/provider:0.2.1
0.22.3
Symptoms
curl -sk https://provider.sandbox-01.aksh.pw:8443/status
(curl will just hang though probing the port usingnc -vz provider 8443
succeeds);Logs
provider
ip-operator
Workaround
Workaround is to kick the IP Operator (akash-provider does NOT have to be kicked, it will automatically recover once ip-operator recovers):
Additional ideas
Until it gets fixed in the ip-operator, we can try to leverage the
livenessProbe
/readinessProbe
so the pod restarts when it sees the ip-operator hasn't been ready/functioning for longer than10 minutes
or so.