Open andy108369 opened 9 months ago
"Fixed" since that's only a workaround,
akash-provider
should fail to deploy the app that can't get the leased IPs ; and it should also not get stuck because of that / self-recover
After investigating the Europlots provider, I found that there was a deployment (~20 hours
ago) created, with 3 leased IPs, however it had never gotten the IP assigned.
This was breaking the inventory-service
in the akash-provider.
To resolve it, I've closed the bid of that deployment and then had to bounce akash-provider
pod for it to start bidding again.
Logs https://transfer.sh/xmB3HiBXTG/provider-is-bidding-again-dseq-13002853.log
inventory-service
breaks when the deployment service can't get IP assigned (IP leasing)akash-provider
doesn't self recover after that condition is fixed (after deployment is closed)root@control1:~# kubectl get events -A --sort-by='.metadata.creationTimestamp' |grep -Ev 'scan-vuln|akash-provider|operator' │
NAMESPACE LAST SEEN TYPE REASON OBJECT MESSAGE │
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq 42m Warning AllocationFailed service/app-ip-80-tcp Failed to allocate IP for "fseibv1hr266elp9crb06v2│
s2nb4s96qq387nhv56fdgq/app-ip-80-tcp": no available IPs │
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq 42m Warning AllocationFailed service/app-ip-443-tcp Failed to allocate IP for "fseibv1hr266elp9crb06v2│
s2nb4s96qq387nhv56fdgq/app-ip-443-tcp": no available IPs │
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq 42m Warning AllocationFailed service/app-ip-22-tcp Failed to allocate IP for "fseibv1hr266elp9crb06v2│
s2nb4s96qq387nhv56fdgq/app-ip-22-tcp": no available IPs
root@control1:~# kubectl -n lease get manifest fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq
NAME AGE
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq 20h
root@control1:~# kubectl get manifest -A --sort-by='.metadata.creationTimestamp' | tail -2
lease ne6ubpns0mj0kb6l268fotikc272h8eibilpknbehsgrm 22h
lease fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq 20h <<<<<<<<<<<< THIS one
root@control1:~# kubectl -n lease get manifest fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq -o yaml | grep -i -A5 ip
ip: haproxy_ip_server-3
port: 22
proto: TCP
- endpoint_sequence_number: 1
external_port: 80
global: true
--
ip: haproxy_ip_server-3
port: 80
proto: TCP
- endpoint_sequence_number: 1
external_port: 443
global: true
--
ip: haproxy_ip_server-3
port: 443
proto: TCP
image: ubuntu:latest
name: app
resources:
root@control1:~#
root@control1:~# kubectl -n fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq get all
NAME READY STATUS RESTARTS AGE
pod/app-794bbc75dc-gq54z 1/1 Running 0 20h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/app ClusterIP 10.233.19.168 <none> 80/TCP 20h
service/app-ip-22-tcp LoadBalancer 10.233.27.253 <pending> 22:31562/TCP 20h
service/app-ip-443-tcp LoadBalancer 10.233.27.174 <pending> 443:31410/TCP 20h
service/app-ip-80-tcp LoadBalancer 10.233.40.120 <pending> 80:32591/TCP 20h
service/app-np NodePort 10.233.43.87 <none> 22:30649/TCP,443:30161/TCP 20h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/app 1/1 1 1 20h
NAME DESIRED CURRENT READY AGE
replicaset.apps/app-794bbc75dc 1 1 1 20h
root@control1:~#
root@control1:~# kubectl -n metallb-system logs metallb-controller-7f6b8b7fdd-v95lp --tail=100 | grep error
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T08:19:36Z"}
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T08:19:36Z"}
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T09:18:23Z"}
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T09:18:23Z"}
{"caller":"service.go:140","error":"no available IPs","level":"error","msg":"IP allocation failed","op":"allocateIPs","ts":"2023-09-28T09:18:23Z"}
root@control1:~#
FWIW, IPAddressPool reports 194.28.98.219-194.28.98.220
range, while the provider's metallb-config.yaml
has 194.28.98.218-194.28.98.220
194.28.98.219
has already been allocated to some other app:
root@control1:~# kubectl get svc -A -l akash.network |grep LoadBalancer | column -t
9a2cmcvit3r2bm1dn6vna8dm82dr887fm69atjdfcnp4c app-ip-3333-tcp LoadBalancer 10.233.18.185 194.28.98.219 3333:31042/TCP 24h
9a2cmcvit3r2bm1dn6vna8dm82dr887fm69atjdfcnp4c app-ip-8585-tcp LoadBalancer 10.233.58.174 194.28.98.219 8585:31162/TCP 24h
fm8d7v04vkef5iqu2mpfbt07dm1ilp5930rnj2umg08k8 app-ip-1414-tcp LoadBalancer 10.233.58.8 194.28.98.220 1414:31711/TCP 3d23h
fm8d7v04vkef5iqu2mpfbt07dm1ilp5930rnj2umg08k8 app-ip-1515-tcp LoadBalancer 10.233.44.173 194.28.98.220 1515:32491/TCP 3d23h
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq app-ip-22-tcp LoadBalancer 10.233.27.253 <pending> 22:31562/TCP 20h
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq app-ip-443-tcp LoadBalancer 10.233.27.174 <pending> 443:31410/TCP 20h
fseibv1hr266elp9crb06v2s2nb4s96qq387nhv56fdgq app-ip-80-tcp LoadBalancer 10.233.40.120 <pending> 80:32591/TCP 20h
root@control1:~#
root@control1:~# kubectl -n metallb-system get pods
NAME READY STATUS RESTARTS AGE
metallb-controller-7f6b8b7fdd-v95lp 1/1 Running 0 28d
metallb-speaker-25s86 1/1 Running 1 (140d ago) 184d
metallb-speaker-dkhb4 1/1 Running 8 (25d ago) 184d
metallb-speaker-r4lrx 1/1 Running 4 (25d ago) 184d
metallb-speaker-tm5d5 1/1 Running 15 (25d ago) 184d
metallb-speaker-v5vh9 1/1 Running 2 (70d ago) 184d
root@control1:~# kubectl -n metallb-system get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
controller ClusterIP 10.233.14.11 <none> 7472/TCP 184d
metallb-webhook-service ClusterIP 10.233.49.167 <none> 443/TCP 184d
root@control1:~# curl -s 10.233.14.11:7472/metrics | grep ^metal
metallb_allocator_addresses_in_use_total{pool="default"} 2
metallb_allocator_addresses_total{pool="default"} 2
metallb_k8s_client_config_loaded_bool 1
metallb_k8s_client_config_stale_bool 0
metallb_k8s_client_update_errors_total 3
metallb_k8s_client_updates_total 14520
root@control1:~#
root@control1:~# kubectl -n metallb-system get IPAddressPool default -o json | jq -r '.spec'
{
"addresses": [
"194.28.98.219-194.28.98.220"
],
"autoAssign": true,
"avoidBuggyIPs": false
}
- based on
root@control1:~# cat provider/metallb-config.yaml
---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: default
namespace: metallb-system
spec:
addresses:
- 194.28.98.218-194.28.98.220
autoAssign: true
avoidBuggyIPs: false
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
creationTimestamp: null
name: l2advertisement1
namespace: metallb-system
spec:
ipAddressPools:
- default
Potentially related to the unattended upgrades previously enabled on the provider / or the networking related issue which have also been observed.
Gonna wrap https://github.com/akash-network/support/issues/129#issuecomment-1738943519 into a form of the troubleshooting doc.
This issue is solved. Andrey will add documentation for this.
oct 31:
europlots provider v0.4.6 (
akash18ga02jzaq8cw52anyhzkwta5wygufgu6zsz6xc
) RPC node is 0.26.1 (we have tried different RPC nodes too)Complete Logs
https://transfer.sh/KVMK4nhuW3/provider-not-bidding-dseq-13002055.log
Additional notes
It looks like the last time the provider created a bid was before the mainnet upgrade (0.24 => 0.26) https://www.mintscan.io/akash/tx/6CF9A4C689CB378572AE161F8F8E15059A9C08AF38882182C50B37567E330FBC
RPC looks good, provider withdraws from the leases, detects the order request, but doesn't go further (not checking whether it has enough resources to host the deployment, nor submits the bid).
cmp=inventory-service
fell off?The next messages such as
reservation requested
&reservation count
, should be coming from thecmp=inventory-service
, but there are none.Here is the expected workflow (acc. to the logs) I took from the Hurricane provider:
cluster resources dump
message appears only onceAlso, the messages such as
cluster resources dump
(alsocmp=inventory-service
component) would normally appear in the logs frequently (every few minutes) if looking at other clusters, but he had them only once (when provider-services started).