IP Leases: the IP operator does not come online automatically after node restart

shimpa1 commented 1 year ago

On my test provider:

single bare metal node build
built using helm-charts
using helm-based RPC node
after the worker node restart

The RPC node is in catching up: true state (as expected) and the provider pod is waiting for the RPC node to get to catching up: false state. Meanwhile the IP-Operator pod is waiting for the provider pod.

When the RPC node catches up with the top of the chain, the provider pod starts however the IP operator pod does not recover.

I[2023-02-21|17:43:25.749] check result                                 cmp=provider operator=ip status=503
E[2023-02-21|17:43:25.749] not yet ready                                cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc0018bacc0>" error="ip operator is not yet alive"
I[2023-02-21|17:43:27.751] check result                                 cmp=provider operator=ip status=503
E[2023-02-21|17:43:27.751] not yet ready                                cmp=provider cmp=waiter waitable="<*operatorclients.ipOperatorClient 0xc0018bacc0>" error="ip operator is not yet alive"

Manually restarting the IP operator pod works.

Perhaps implement a probe of some sort to check the status of the provider pod before starting the IP operator pod.

cheers,

Shimpa

andy108369 commented 1 year ago

Ideally that should be done on the provider side so it can detect when ip operator recovers.

But until that, we can see if livenessProbe/readinessProbe could be leveraged, so the pod restarts when it sees the ip operator hasn't been ready/functioning (from the provider point of view) for longer than 10 minutes or so.

andy108369 commented 1 year ago

Moved to https://github.com/akash-network/support/issues/105

akash-network / support

IP Leases: the IP operator does not come online automatically after node restart #76