Etcd does not respect retries if the endpoint(s) cannot be reached

omgrr commented 7 years ago

It looks like if the etcd cluster is unreachable then the client spins with Failed to connect to endpoint 'example.com:2379'. I think the desired behavior would be it should try each endpoint once, and if there is only one endpoint, it should raise immediately.

It also looks like there is retry parameter that a user could use to adjust this further.

There are two issues as I see it

The infinite recursion
The retry parameter isn't used / cannot be set

I can totally take a stab at it if you like! Just let me know, thanks!

davissp14 commented 7 years ago

The infinite recursion In the event users were using this client in their application, I want to avoid raising any exceptions that would force users to have to restart their application to re-establish a connection to Etcd. At least by default.

I think the desired behavior would be it should try each endpoint once, and if there is only one endpoint, it should raise immediately.

I think this is probably use-case driven. Also we should be careful there, as most users will issue a rolling restart across their nodes when performing maintenance / change versions. In this case, we would want retry the set of endpoints multiple times.

The retry parameter isn't used / cannot be set Totally for exposing this!

mgates commented 7 years ago

I think it makes sense to try endpoints multiple times, but it should maybe still respect the 'retry' count, right now it's ignored except for auth errors?

Also, any thoughts about switching from recursion to using 'retry'?

davissp14 commented 7 years ago

There's two things here. Retries per endpoint and how many times we would want to cycle through all endpoints before exploding, right?

mgates commented 7 years ago

I think they're different. One is that the retries value is ignored for some errors an not set-able. I think we for sure what to change that. I think the other is how we want to conceptualize retries if there are multiple servers:

have two values, one for cycles and one for retries for other errors
count one cycle as one retry
count once connection attempt as one retry

I don't really have a preference, as long as there is some way of controlling the number of retrys if the server is down. I think that the answer to the second problem needs to be figured out as part of making non-auth errors respect the retries argument.

davissp14 commented 7 years ago

I could see some people wanting to prioritize certain nodes over others, especially with gRPC proxies which do some caching. Some people may also want to prioritize the leader due to latency concerns. This is probably separate functionality that we can build on later, but may be worth considering. With that being said, I think having two values may make the most sense. Thoughts?

omgrr commented 7 years ago

Sorry for going quiet here! I'm a little confused on how the two values would interact with each other.

If I have,

cycles set to 2
retries set to 5
And 3 endpoints

Does it retry on each endpoint 5 times and will cycle through all of the endpoints 2 times? So 30 calls?

Or would it cycle through the endpoints, up to the retry limit. So 5 calls. This would be my personal preferred approach, but kind of defeats the purpose of a cycle value.

davissp14 commented 7 years ago

Or would it cycle through the endpoints, up to the retry limit. So 5 calls.

In a 3 node cluster, this would result in 2 of the 3 endpoints retrying twice and 1 endpoint retrying once, right? Or did I misunderstand that?

In anycase, it may be worth seeing how other people are solving this type of thing. Sorry I haven't been super responsive, work + moving is kicking my ass.

omgrr commented 7 years ago

This would result in 2 of the 3 endpoints retrying twice and 1 endpoint retrying once, right? Or did I misunderstand that?

Yup, thats right. Maybe thats strange though?

May be worth seeing how other people are solving this.

Yea, agreed. I'll look around some. Particularly interested in how the proxy does this.

cosgroveb commented 6 years ago

It seems like retry policies are coming as an official part of the protocol that this library could make-use of in the future: https://github.com/grpc/proposal/blob/master/A6-client-retries.md#maximum-number-of-retries

Something to keep an eye on at least.

davissp14 / etcdv3-ruby

Etcd does not respect retries if the endpoint(s) cannot be reached #107