Etcd node failover - Githubissues

007hacky007 commented 4 years ago

Added client fail-over. Simple yet working solution to transparently handle failed etcd node by re-trying request on the another available etcd host.

007hacky007 commented 4 years ago

Thank you for the input! I was actually thinking of multiple ways implementing this and went with the most independent version, but I'll implement it as ClientInterface, no problem. As for switching clients without changing the code, I believe it's possible even with current implementation when it just calls the Client class which is ClientInterface compatible. As for the usage in the ShardedClient... I believe you either want to use ShardedClient or FailoverClient, can't think of why whould you want to use multi-client inside another multi-client. But you are probably right, implementing ClientInterface is probably cleaner and correct solution, so I'm going to change it.

As for the Exceptions - thats also what I was thinking about, however it's definitely not only UnavailableException for sure. For timeout you get DeadlineExceededException and for the connection error (i.e. host completely down or network error maybe) you get UnavailableException. So there are at least two of them which you want retry. But I guess we can with these two and add more in the future if needed.

I'll add custom Exception for that. Good point.

Again, thank you for your input regarding etcd experiences! I'll implement per node retry counter which will precede the hold-off time penalty.

Damn, you are right about balancing part.

matthi4s commented 4 years ago

You can replace a Client with a FailoverClient, but with typed arguments or return type that won't work without implementing the ClientInterface.

A FailoverClient inside a ShardedClient makes sense if you have multiple clusters (each with one FailoverClient) and distribute access to keys in those clusters via a ShardedClient.

I haven't seen a DeadlineExceededException yet and I'm not really sure when it occurs, but it looks like it makes sense to retry it.

007hacky007 commented 4 years ago

DeadlineExceededException occurs with newly implemented timeout, when it's reached. Tested and it occurs consistently on request timeout.

By the way, is there any particular reason for client hashing in the ShardedClient? I mean what is the advantage of sending same key to the same etcd server? I believe it has some reason, but I could not think of any. It should not make a difference which etcd node handles the request.

matthi4s commented 4 years ago

Etcd clustering has no sharding, it replicates all keys to all nodes. More nodes only brings more reliability, but not more performance, which is why 3 or 5 nodes are recommended (according to https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#scaling-up-etcd-clusters). Scaling beyond that only works with independent etcd clusters and client side sharding, which is what the ShardedClient does.

007hacky007 commented 4 years ago

But only single node is in the write mode at the time. Others are simply passing the requests to the current master, so I still can't see any gain from hashing based on key.

matthi4s commented 4 years ago

If the performance of a single cluster isn't enough, you have to create another independent cluster and in order to always send requests to the same cluster for the same key, client-side sharding is required.

007hacky007 commented 4 years ago

Ah, makes sense now! I did not think in the scale of multiple clusters. I was thinking of just multiple etcd nodes.

With multiple etcd cluster it makes sense.

Thanks for explaining it!

007hacky007 commented 4 years ago

Hopefully fixed. I've also added protected method isLocalCall, so the fail counter is reset only when actually calling remote etcd endpoint and not when creating objects locally (like in case of getCompare method).

Also a little bit of hardening of the refreshLease method included.

007hacky007 commented 4 years ago

Thank you for your valuable input. Changed according to your comment.

aternosorg / php-etcd

Etcd node failover #6