EventStore / EventStoreDB-Client-Dotnet-Legacy

1 stars 9 forks source link

Unrecoverable NotAuthenticatedException during cluster upgrade #25

Open megakid opened 2 years ago

megakid commented 2 years ago

Describe the bug We cannot reproduce this reliably but when upgrading our 3 node UAT clusters from V5 to V21, we noticed that some of our services - which we expected to reconnect automatically (as with a master failover) - started extreme spamming of logs, high CPU etc

It seems the clientside EventStoreConnection gets into state whereby the connection is marked as not authenticated (although the credentials have not changed during cluster rollout). From this state, the connection object is unrecoverable and needs recreating, we did this by a service restart (everything works after a restart).

We have noticed this behaviour in more than one service and across a couple of our clusters. An educated guess is that 10% of ES clients that we have performed the ES cluster upgrade on have suffered this issue, with the other 90% reconnecting perfectly and continuing to subscribe/read/append to streams.

To Reproduce Steps to reproduce the behavior:

  1. Service running with persistent subscriptions
  2. Upgrade 3 node cluster to V21 by (as per v5 -> v21 upgrade notice) shutting down all nodes, rolling out v21 nodes + config (keep credentials the same)
  3. See that most of the time, the clients re-establish the connection whilst in the minority of times, they get into a clientside auth state which prevents recovery.

Expected behavior Clients to reconnect without auth issues

Actual behavior As above.

Config/Logs/Screenshots Stack traces are from a few common operations:

EventStore.ClientAPI.Exceptions.NotAuthenticatedException: Not Authenticated
   at async Task<WriteResult> EventStore.ClientAPI.Internal.EventStoreNodeConnection.AppendToStreamAsync(string stream, long expectedVersion, IEnumerable<EventData> events, UserCredentials userCredentials)
EventStore.ClientAPI.Exceptions.NotAuthenticatedException: Not Authenticated
   at async Task<EventStorePersistentSubscriptionBase> EventStore.ClientAPI.EventStorePersistentSubscriptionBase.Start()

EventStore details

megakid commented 2 years ago

We think this is likely because we haven't set the RetryAuthenticationOnTimeout flag. I do think if DefaultUserCredentials are set, it should not allow the connection state to proceed to ConnectingPhase.Identification unless the ConnectingPhase.Authentication successfully completes.
Not asserting that means that transient errors (e.g. a timeout) that aren't surfaced to user code - except via AuthenticationFailed event - are silently ignored and cause unexpected, unrecoverably behaviour for the lifetime of the EventStore client object. The addition of RetryAuthenticationOnTimeout seems to mitigate one failure modes but, if I understand the current code correctly, if the server responds with NotAuthenticated, it still continues to connect.