Open megakid opened 2 years ago
We think this is likely because we haven't set the RetryAuthenticationOnTimeout
flag. I do think if DefaultUserCredentials
are set, it should not allow the connection state to proceed to ConnectingPhase.Identification
unless the ConnectingPhase.Authentication
successfully completes.
Not asserting that means that transient errors (e.g. a timeout) that aren't surfaced to user code - except via AuthenticationFailed event
- are silently ignored and cause unexpected, unrecoverably behaviour for the lifetime of the EventStore client object. The addition of RetryAuthenticationOnTimeout
seems to mitigate one failure modes but, if I understand the current code correctly, if the server responds with NotAuthenticated
, it still continues to connect.
Describe the bug We cannot reproduce this reliably but when upgrading our 3 node UAT clusters from V5 to V21, we noticed that some of our services - which we expected to reconnect automatically (as with a master failover) - started extreme spamming of logs, high CPU etc
It seems the clientside EventStoreConnection gets into state whereby the connection is marked as not authenticated (although the credentials have not changed during cluster rollout). From this state, the connection object is unrecoverable and needs recreating, we did this by a service restart (everything works after a restart).
We have noticed this behaviour in more than one service and across a couple of our clusters. An educated guess is that 10% of ES clients that we have performed the ES cluster upgrade on have suffered this issue, with the other 90% reconnecting perfectly and continuing to subscribe/read/append to streams.
To Reproduce Steps to reproduce the behavior:
Expected behavior Clients to reconnect without auth issues
Actual behavior As above.
Config/Logs/Screenshots Stack traces are from a few common operations:
EventStore details