cybertec-postgresql / vip-manager

Manages a virtual IP based on state kept in etcd or Consul
BSD 2-Clause "Simplified" License
207 stars 41 forks source link

Handle etcd leader changes #228

Closed lukasertl closed 5 months ago

lukasertl commented 5 months ago

This is somewhat related to #208

When the etcd leader restarts, vip-manager decides to remove the VIP:

May 22 08:21:31 hostname vip-manager[4212]: 2024/05/22 08:21:31 IP address 10.x.y.z/16 state is true, desired true
May 22 08:21:41 hostname vip-manager[4212]: 2024/05/22 08:21:41 IP address 10.x.y.z/16 state is true, desired true
May 22 08:21:43 hostname vip-manager[4212]: {"level":"warn","ts":"2024-05-22T08:21:43.182166+0200","logger":"etcd-client","caller":"v3@v3.5.13/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001ba5a0/vvhu4255.power.inet:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: leader changed"}
May 22 08:21:43 hostname vip-manager[4212]: {"level":"error","ts":"2024-05-22T08:21:43.189178+0200","logger":"etcd-client","caller":"v3@v3.5.13/retry_interceptor.go:114","msg":"clientv3/retry_interceptor: getToken failed","error":"etcdserver: leader changed","stacktrace":"go.etcd.io/etcd/client/v3.(*Client).streamClientInterceptor.func1\n\t/home/runner/go/pkg/mod/go.etcd.io/etcd/client/v3@v3.5.13/retry_interceptor.go:114\ngoogle.golang.org/grpc.(*ClientConn).NewStream\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/stream.go:167\ngo.etcd.io/etcd/api/v3/etcdserverpb.(*watchClient).Watch\n\t/home/runner/go/pkg/mod/go.etcd.io/etcd/api/v3@v3.5.13/etcdserverpb/rpc.pb.go:6690\ngo.etcd.io/etcd/client/v3.(*watchGrpcStream).openWatchClient\n\t/home/runner/go/pkg/mod/go.etcd.io/etcd/client/v3@v3.5.13/watch.go:1004\ngo.etcd.io/etcd/client/v3.(*watchGrpcStream).newWatchClient\n\t/home/runner/go/pkg/mod/go.etcd.io/etcd/client/v3@v3.5.13/watch.go:901\ngo.etcd.io/etcd/client/v3.(*watchGrpcStream).run\n\t/home/runner/go/pkg/mod/go.etcd.io/etcd/client/v3@v3.5.13/watch.go:661"}
May 22 08:21:43 hostname vip-manager[4212]: 2024/05/22 08:21:43 etcd watcher returned error: etcdserver: leader changed
May 22 08:21:43 hostname vip-manager[4212]: 2024/05/22 08:21:43 IP address 10.x.y.z/16 state is true, desired false
May 22 08:21:43 hostname vip-manager[4212]: 2024/05/22 08:21:43 Removing address 10.x.y.z/16 on ens192
May 22 08:21:43 hostname vip-manager[4212]: 2024/05/22 08:21:43 IP address 10.x.y.z/16 state is false, desired false
May 22 08:21:51 hostname vip-manager[4212]: 2024/05/22 08:21:51 IP address 10.x.y.z/16 state is false, desired false

This happens with vip-manager 2.4.0

pashagolub commented 5 months ago

And that's exactly what we expect, no?

lukasertl commented 5 months ago

No we don't expect that. This is not a change in patroni leadership, but etcd leadership.

pashagolub commented 5 months ago

oh, I see! Thanks. Will check this

pashagolub commented 5 months ago

Would you please check the PR if it works for you?

Thanks in advance!

lukasertl commented 5 months ago

Hi Pavlo,

I'm afraid this is not the correct fix. If I trigger the leader change situation with a patched vip-manager, it will leave the VIP setup intact, but the process is spinning on CPU.

I tried to find out what happens here, and I suspect that the select{} in the watch() function doesn't block anymore, thus running in an uncontrolled infinite loop. My guess is that at this point the etcd Watch isn't valid anymore and needs to be setup from scratch.

This is confirmed by the fact that if I switch patroni roles in this situation, the (non-broken) vip-manager on the new primary would add the VIP to the interface, but the (broken) vip-manager on the old primary wouldn't remove it.