Closed andy108369 closed 2 months ago
Current livenessProbe checks helping in provider restart whenever this issue gets detected, so the manifest still can be sent to the provider whenever users want to deploy something.
Liveness checks involved: https://github.com/akash-network/helm-charts/blob/provider-9.2.5/charts/akash-provider/templates/statefulset.yaml#L261-L270 https://github.com/akash-network/helm-charts/blob/provider-9.2.5/charts/akash-provider/scripts/liveness_checks.sh#L12-L16
Running the liveness check manually confirms the issue:
$ kubectl -n akash-services describe pod akash-provider-0 | grep -i liveness
Liveness: exec [sh -c /scripts/liveness_checks.sh] delay=240s timeout=60s period=60s #success=1 #failure=3
Normal Killing 13m (x4 over 53m) kubelet Container provider failed liveness probe, will be restarted
Warning Unhealthy 6m7s (x14 over 55m) kubelet Liveness probe failed: api /status check failed
$ time kubectl -n akash-services exec -ti akash-provider-0 -- bash -x /scripts/liveness_checks.sh
Defaulted container "provider" out of: provider, init (init)
+ set -o pipefail
+ openssl x509 -in /config/provider.pem -checkend 3600 -noout
+ timeout 30s curl -o /dev/null -fsk https://127.0.0.1:8443/status
+ echo 'api /status check failed'
api /status check failed
+ exit 1
command terminated with exit code 1
real 0m31.510s
user 0m0.086s
sys 0m0.022s
Additionally attached the complete logs for the provider pods taken with --previous
argument to ensure the logs are of the pods which stopped answering over the status (8443/status & 8444 grpc) endpoints.
SW versions:
$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME IMAGE
akash-node-1-0 ghcr.io/akash-network/node:0.32.2
akash-provider-0 ghcr.io/akash-network/provider:0.5.12
operator-hostname-74744f497c-kn5v7 ghcr.io/akash-network/provider:0.5.12
operator-inventory-6985f746d4-jfpt2 ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node1 ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node2 ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node3 ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node4 ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node5 ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node6 ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node7 ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node8 ghcr.io/akash-network/provider:0.5.12
h100.mon.obl.akash.pub
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 52 (3m27s ago) 9h
a100.mon.obl.akash.pub
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 13 (10m ago) 9h
sg.lnlm.akash.pub
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 5 (5h35m ago) 28h
provider-02.sandbox-01.aksh.pw
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 3 (4h10m ago) 28h
pdx.nb.akash.pub
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 2 (6h2m ago) 19h
ty.lneq.akash.pub
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 2 (10h ago) 28h
sg.lneq.akash.pub
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 2 (18h ago) 28h
au.lneq.akash.pub
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 1 (9h ago) 28h
akash.pro
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 1 (28h ago) 28h
hk.lneq.akash.pub
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 1 (25h ago) 28h
sl.lneq.akash.pub
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 0 28h
hurricane.akash.pub
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 0 28h
yvr.nb.akash.pub
NAME READY STATUS RESTARTS AGE
akash-provider-0 1/1 Running 0 28h
Providers 0.5.12 can downgrade to 0.5.11 using these commands:
cd provider
helm -n akash-services upgrade akash-hostname-operator akash/akash-hostname-operator --set image.tag=0.5.11
helm -n akash-services upgrade inventory-operator akash/akash-inventory-operator --set image.tag=0.5.11
helm -n akash-services upgrade akash-ip-operator akash/akash-ip-operator --set provider_address=<SET-your-provider-address-HERE> --set image.tag=0.5.11
helm upgrade akash-provider akash/provider -n akash-services -f provider.yaml --set bidpricescript="$(cat price_script_generic.sh | openssl base64 -A)" --set image.tag=0.5.11
I've noticed this primarily happening on
*.mon.obl.akash.pub
providers; most frequently on theh100.mon.obl.akash.pub provider
Additional observations while requests to
:8443/status
& 8444 grpc endpoints hang:Logs
h100.mon.obl.akash.pub.provider.log
h100.mon.obl.akash.pub.deployment-operator-inventory.log