akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

provider `0.5.12`: 8443/status & 8444 grpc endpoints sporadically falling off #214

Closed andy108369 closed 2 months ago

andy108369 commented 2 months ago

I've noticed this primarily happening on *.mon.obl.akash.pub providers; most frequently on the h100.mon.obl.akash.pub provider

Additional observations while requests to :8443/status & 8444 grpc endpoints hang:

Logs

logs - before provider & operator-inventory operators were restarted

h100.mon.obl.akash.pub.provider.log

h100.mon.obl.akash.pub.deployment-operator-inventory.log

andy108369 commented 2 months ago

FYI: Current workaround

Current livenessProbe checks helping in provider restart whenever this issue gets detected, so the manifest still can be sent to the provider whenever users want to deploy something.

Liveness checks involved: https://github.com/akash-network/helm-charts/blob/provider-9.2.5/charts/akash-provider/templates/statefulset.yaml#L261-L270 https://github.com/akash-network/helm-charts/blob/provider-9.2.5/charts/akash-provider/scripts/liveness_checks.sh#L12-L16

Running the liveness check manually confirms the issue:

$ kubectl -n akash-services describe pod akash-provider-0 | grep -i liveness
    Liveness:  exec [sh -c /scripts/liveness_checks.sh] delay=240s timeout=60s period=60s #success=1 #failure=3
  Normal   Killing    13m (x4 over 53m)    kubelet  Container provider failed liveness probe, will be restarted
  Warning  Unhealthy  6m7s (x14 over 55m)  kubelet  Liveness probe failed: api /status check failed
$ time kubectl -n akash-services exec -ti akash-provider-0 -- bash -x /scripts/liveness_checks.sh
Defaulted container "provider" out of: provider, init (init)
+ set -o pipefail
+ openssl x509 -in /config/provider.pem -checkend 3600 -noout
+ timeout 30s curl -o /dev/null -fsk https://127.0.0.1:8443/status
+ echo 'api /status check failed'
api /status check failed
+ exit 1
command terminated with exit code 1

real    0m31.510s
user    0m0.086s
sys 0m0.022s
andy108369 commented 2 months ago

affected providers

Additionally attached the complete logs for the provider pods taken with --previous argument to ensure the logs are of the pods which stopped answering over the status (8443/status & 8444 grpc) endpoints.

SW versions:

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                          IMAGE
akash-node-1-0                                ghcr.io/akash-network/node:0.32.2
akash-provider-0                              ghcr.io/akash-network/provider:0.5.12
operator-hostname-74744f497c-kn5v7            ghcr.io/akash-network/provider:0.5.12
operator-inventory-6985f746d4-jfpt2           ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node1   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node2   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node3   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node4   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node5   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node6   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node7   ghcr.io/akash-network/provider:0.5.12
operator-inventory-hardware-discovery-node8   ghcr.io/akash-network/provider:0.5.12

no issues on these providers, yet :crossed_fingers:

andy108369 commented 2 months ago

Providers 0.5.12 can downgrade to 0.5.11 using these commands:

cd provider
helm -n akash-services upgrade akash-hostname-operator akash/akash-hostname-operator --set image.tag=0.5.11
helm -n akash-services upgrade inventory-operator akash/akash-inventory-operator --set image.tag=0.5.11
helm -n akash-services upgrade akash-ip-operator akash/akash-ip-operator --set provider_address=<SET-your-provider-address-HERE> --set image.tag=0.5.11

helm upgrade akash-provider akash/provider -n akash-services -f provider.yaml --set bidpricescript="$(cat price_script_generic.sh | openssl base64 -A)" --set image.tag=0.5.11