akash-network / support

Akash Support and Issue Tracking
5 stars 3 forks source link

provider stops responding over 8443/status, 8444 sporadically (either immediately after start or after some time) #190

Closed andy108369 closed 3 months ago

andy108369 commented 3 months ago

Hurricane provider stops responding over 8443/status, 8444 sporadically (either immediately after start or after some time) since upgrading it from 0.4.8 to 0.5.4

NOTE: AKASH_IP_OPERATOR=false, akash ip operator helm chart not present (normally IP Leasing would be enabled, but I've disabled it as I've initially thought it was causing the problem)

nvidia-device-plugin-0.14.5     0.14.5
akash-node-9.0.0                0.30.0
provider-9.1.0                  0.5.4
akash-hostname-operator-9.0.5   0.5.4
akash-inventory-operator-9.0.5  0.5.4
ingress-nginx-4.10.0            1.10.0
rook-ceph-v1.12.4               v1.12.4
rook-ceph-cluster-v1.12.4       v1.12.4

Logs

Workarounds

I've implemented automatic provider pod restart if livenessProbe finds it cannot get the data from 8443/status, etc

Will keep monitoring the akash-provider pod restart count.

Additional notes

I have not observed this issue on any other provider except for the Hurricane provider since we've upgraded providers from 0.4.8 to 0.5.4.

andy108369 commented 3 months ago

No restarts nor issues since the last time provider was started (26hrs uptime). I'll let it run like this for over the weekend and will enable the IP Leasing back again.

chainzero commented 3 months ago

Awaiting further testing by @andy108369 prior to further investigation

andy108369 commented 3 months ago

Enabled the IP Leasing back again:

  1. provider.yaml

    ipoperator: true
  2. installed metallb chart and applied the config

    helm upgrade --install metallb metallb/metallb -n metallb-system --version 0.14.3
    kubectl apply -f metallb-config.yaml
  3. installed akash-ip-operator chart

    helm upgrade --install akash-ip-operator akash/akash-ip-operator -n akash-services --set provider_address=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
andy108369 commented 3 months ago

can't see this issue any longer with provider 0.5.11 closing.