provider `0.5.12`: 8443/status & 8444 grpc endpoints sporadically falling off

andy108369 commented 2 months ago

I've noticed this primarily happening on *.mon.obl.akash.pub providers; most frequently on the h100.mon.obl.akash.pub provider

Additional observations while requests to :8443/status & 8444 grpc endpoints hang:

Restarting operator-inventory does not change anything;
provider keeps producing the logs;

Logs

logs - before provider & operator-inventory operators were restarted

h100.mon.obl.akash.pub.provider.log

h100.mon.obl.akash.pub.deployment-operator-inventory.log

andy108369 commented 2 months ago

FYI: Current workaround

Current livenessProbe checks helping in provider restart whenever this issue gets detected, so the manifest still can be sent to the provider whenever users want to deploy something.

Liveness checks involved: https://github.com/akash-network/helm-charts/blob/provider-9.2.5/charts/akash-provider/templates/statefulset.yaml#L261-L270 https://github.com/akash-network/helm-charts/blob/provider-9.2.5/charts/akash-provider/scripts/liveness_checks.sh#L12-L16

Running the liveness check manually confirms the issue:

$ kubectl -n akash-services describe pod akash-provider-0 | grep -i liveness
    Liveness:  exec [sh -c /scripts/liveness_checks.sh] delay=240s timeout=60s period=60s #success=1 #failure=3
  Normal   Killing    13m (x4 over 53m)    kubelet  Container provider failed liveness probe, will be restarted
  Warning  Unhealthy  6m7s (x14 over 55m)  kubelet  Liveness probe failed: api /status check failed

$ time kubectl -n akash-services exec -ti akash-provider-0 -- bash -x /scripts/liveness_checks.sh
Defaulted container "provider" out of: provider, init (init)
+ set -o pipefail
+ openssl x509 -in /config/provider.pem -checkend 3600 -noout
+ timeout 30s curl -o /dev/null -fsk https://127.0.0.1:8443/status
+ echo 'api /status check failed'
api /status check failed
+ exit 1
command terminated with exit code 1

real    0m31.510s
user    0m0.086s
sys 0m0.022s

andy108369 commented 2 months ago

affected providers

Additionally attached the complete logs for the provider pods taken with --previous argument to ensure the logs are of the pods which stopped answering over the status (8443/status & 8444 grpc) endpoints.

SW versions:

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                          IMAGE
akash-node-1-0                                ghcr.io/akash-network/node:0.32.2
akash-provider-0                              ghcr.io/akash-network/provider:0.5.12
operator-hostname-74744f497c-kn5v7            ghcr.io/akash-network/provider:0.5.12
operator-inventory-6985f746d4-jfpt2           ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node1   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node2   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node3   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node4   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node5   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node6   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node7   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node8   ghcr.io/akash-network/provider:0.5.12

h100.mon.obl.akash.pub

NAME               READY   STATUS    RESTARTS         AGE
akash-provider-0   1/1     Running   52 (3m27s ago)   9h

h100.mon.obl.akash.pub.provider.previous.log

a100.mon.obl.akash.pub

NAME               READY   STATUS    RESTARTS       AGE
akash-provider-0   1/1     Running   13 (10m ago)   9h

a100.mon.obl.akash.pub.provider.previous.log

sg.lnlm.akash.pub

NAME               READY   STATUS    RESTARTS        AGE
akash-provider-0   1/1     Running   5 (5h35m ago)   28h

provider-02.sandbox-01.aksh.pw

NAME               READY   STATUS    RESTARTS        AGE
akash-provider-0   1/1     Running   3 (4h10m ago)   28h

provider-02.sandbox-01.aksh.pw.provider.previous.log

pdx.nb.akash.pub

NAME               READY   STATUS    RESTARTS       AGE
akash-provider-0   1/1     Running   2 (6h2m ago)   19h

pdx.nb.akash.pub.provider.previous.log

ty.lneq.akash.pub

NAME               READY   STATUS    RESTARTS      AGE
akash-provider-0   1/1     Running   2 (10h ago)   28h

ty.lneq.akash.pub.provider.previous.log

sg.lneq.akash.pub

NAME               READY   STATUS    RESTARTS      AGE
akash-provider-0   1/1     Running   2 (18h ago)   28h

sg.lneq.akash.pub.provider.previous.log

au.lneq.akash.pub

NAME               READY   STATUS    RESTARTS     AGE
akash-provider-0   1/1     Running   1 (9h ago)   28h

au.lneq.akash.pub.provider.previous.log

akash.pro

NAME               READY   STATUS    RESTARTS      AGE
akash-provider-0   1/1     Running   1 (28h ago)   28h

akash.pro.provider.previous.log

hk.lneq.akash.pub

NAME               READY   STATUS    RESTARTS      AGE
akash-provider-0   1/1     Running   1 (25h ago)   28h

hk.lneq.akash.pub.provider.previous.log

no issues on these providers, yet :crossed_fingers:

sl.lneq.akash.pub

NAME               READY   STATUS    RESTARTS   AGE
akash-provider-0   1/1     Running   0          28h

hurricane.akash.pub

NAME               READY   STATUS    RESTARTS   AGE
akash-provider-0   1/1     Running   0          28h

yvr.nb.akash.pub

NAME               READY   STATUS    RESTARTS   AGE
akash-provider-0   1/1     Running   0          28h

andy108369 commented 2 months ago

Providers 0.5.12 can downgrade to 0.5.11 using these commands:

cd provider
helm -n akash-services upgrade akash-hostname-operator akash/akash-hostname-operator --set image.tag=0.5.11
helm -n akash-services upgrade inventory-operator akash/akash-inventory-operator --set image.tag=0.5.11
helm -n akash-services upgrade akash-ip-operator akash/akash-ip-operator --set provider_address=<SET-your-provider-address-HERE> --set image.tag=0.5.11

helm upgrade akash-provider akash/provider -n akash-services -f provider.yaml --set bidpricescript="$(cat price_script_generic.sh | openssl base64 -A)" --set image.tag=0.5.11

akash-network / support