Closed smatsson closed 11 months ago
Hi @smatsson retrying the operation should result in the connection being established to the new leader and succeeding. Locally I am sometimes having to retry twice (off hand I'm actually not certain why that is) - but are you seeing continue to fail after lots of retries?
Something that has changed since v5 is the clients in v5 could be configured to automatically issue lots of retries but in the new grpc clients the approach has been to leave retries up to the user. It's possible that we might refine this in the future though
Hi @timothycoleman and thanks for the reply!
From a user perspective I assumed that the client would do automatic retries since we feed it all the nodes but on the other hand I can definitely see the complexity that this brings to the client code. Doing our own retries is OK with us 👍
From our logs it seems that our retry does not connect to the new leader. I can only guess that this is because the exception isn't threated as a "node is down" kind of exception, so potentially an additional retry here would solve the issue? Are there any other inputs to how the client determines which node to connect to that we need to take into consideration?
I looked a bit more carefully into why it might fail twice. Essentially when it loses the connection to the leader it starts picking nodes from the connection string to ask what nodes are in the cluster and which is the leader. it can be the case that the node it picks hasn't realised that the leader has gone down yet, resulting in the client attempting to connect again to the old leader. I expect we can refine this process, but it does stabilize on the new leader.
Since you're using the client as a singleton (which is good), you should find that either the retry or subsequent calls to your Append(...) method start succeeding. If you're seeing that the subsequent calls to Append start succeding, then its likely that just retrying some more would allow the failed write to succeed. If, however, the subsequent calls to Append all continue to fail then there is something going wrong that we haven't identified yet.
Reconnection is triggered on a NotLeaderException and also on a RpcException with StatusCode Unavailable, so it looks like it should be reconnecting
Thank you very much for investigating further. I put together a small console app to try how the reconnection occurs when doing retries and as you pointed out, it seems to reconnect to the new leader once the cluster is done with leader election. This can of course vary depending on network latency between the nodes etc but it seems in my test that the new leader is connected to sooner or later (most often sooner). In our real code we opted to do three retries at delays of 500 ms, 2 sec and 5 sec.
A friendly request would be to put in the docs somewhere that we need to do our own retries as this wasn't clear.
Once again, thank you for your help and time!
You're welcome! and thanks for the feedback - we're actually just beginning a review of the documentation in general so i'll make sure this gets passed on
Hi! We have run into a odd case where the client does not fallback to the new leader when restarting the current leader node. I'm not really sure if this is due to the server, the client or our code so any help would be appreciated.
We have a cluster running ES 23.6.0. The cluster consists of three Windows machines. One leader, two followers. EventStore.Client.Grpc.Streams 23.1.0 is used as the client. When doing Windows update on the machine we start with a follower and wait for ES to be completely up and running before going forward with the next node. This means waiting for sync with the other nodes and waiting for index rebuilding + chunk verification.
Restarting each follower seems to work just as it should. Once we restart the leader node some requests to our API fail with a "RpcException" indicating that the leader node cannot be connected to, does not respond etc. This seems reasonable as the leader node machine is now restarting. I'm curious why the client does not reconnect to the new leader instead of retrying over and over again for the old, now down, leader? I would assume that since we feed the client with the complete connection string (containing all nodes) it would change to another node when it determines that the current node is down? We previously ran ES 5.x for many years and never had any similar issues.
Connection string looks like this:
Our code looks like this. The EventStore class is injected as a singleton.
We can see in our logs that we have logged "GrpcException: Retry Append..." and that both the first call and the retry fails for similar reasons. For example:
Followed by on retry:
In one or two entries we also got the following on a retry after
Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="Server not ready")
:As pointed out in the beginning, I'm not sure if this is an issue with ES or our code. Any help is greatly appreciated :slightly_smiling_face: