How to use for master failover in high availability scenario on a master/replica sentinel environment?

So we have a Redis setup with 1 master, 2 slaves and 3 sentinels (one on each machine).

In StackExchange.Redis configuration, all 3 addresses are specified separated by a ',' as mentioned on the Configuration documentation page.

Here's the configuration:

            var settings = ConfigurationOptions.Parse("10.255.0.20,10.255.0.21,10.255.0.22
");
            settings.ConnectRetry = 3;
            settings.ConnectTimeout = 10000;
            settings.SyncTimeout = 20000;

            Multiplexer = ConnectionMultiplexer.Connect(settings);

Redis by itself seems to works fine. When we shutdown the redis master, I can clearly see with redis-cli that sentinels do their job and select a slave switch to master to take over.

I also see that doing a ConnectionMultiplexer.Connect() will find the proper active server among the 3 to connect to when starting.

Unfortunately, when shutting down the server it is connected to while the application is running, it seems to timeout and throw an exception which in my case crash the process. Is the exception "normal" and need to be caught/retried manually, or should the library automatically try to connect to another server automatically when the selected one shut down?

I see in the changelog that in version 2.1.0 you added sentinel support, but there's no documentation for it. What exactly does it do? Does it prevent the need to write all 3 addresses in the config? It is necessary for high availability usage or was is working previously anyway? Is there any different function or configuration to adapt in the code to use it?

Just a bit of documentation on how this should be set up would be nice.

Here's the exception call stack in the Windows event viewer:

Description: The process was terminated due to an unhandled exception.
Exception Info: System.Net.Sockets.SocketException
   at Pipelines.Sockets.Unofficial.Internal.Throw.Socket(Int32)
   at Pipelines.Sockets.Unofficial.SocketAwaitableEventArgs.GetResult()
   at Pipelines.Sockets.Unofficial.SocketConnection+<DoReceiveAsync>d__74.MoveNext()

Exception Info: Pipelines.Sockets.Unofficial.ConnectionResetException
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
   at System.IO.Pipelines.Pipe.GetReadResult(System.IO.Pipelines.ReadResult ByRef)
   at System.IO.Pipelines.Pipe.GetReadAsyncResult()
   at StackExchange.Redis.PhysicalConnection+<ReadFromPipe>d__110.MoveNext()

Exception Info: StackExchange.Redis.RedisConnectionException

Exception Info: System.AggregateException

Exception Info: StackExchange.Redis.RedisConnectionException
   at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[[System.Int64, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]](StackExchange.Redis.Message, StackExchange.Redis.ResultProcessor`1<Int64>, StackExchange.Redis.ServerEndPoint)
   at StackExchange.Redis.RedisBase.ExecuteSync[[System.Int64, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]](StackExchange.Redis.Message, StackExchange.Redis.ResultProcessor`1<Int64>, StackExchange.Redis.ServerEndPoint)
   at MySystem.Server.Security.Membership.Sessions.Remove(System.String)
   at System.EventHandler.Invoke(System.Object, System.EventArgs)
   at System.ServiceModel.Channels.CommunicationObject.OnClosed()

I see that the "Configuration" documentation page has been updated since opening this issue with details on how to connect to the Sentinel directly which will detect all connected servers automatically instead of having to setup multiple IPs by specifying serviceName:

var settings = ConfigurationOptions.Parse("localhost,serviceName=mymaster");

This is great, but when it comes to the behavior in case of failover it doesn't seems to change anything.

I did multiple tests, and I see that when killing the master, any read operation will keep succeeding as long as there is a replica. But any write operation (and strangely subsequent reads too) will trigger an exception until the failover is finished and the replica successfully switched to master, which can take some time. This is defined by the down-after-milliseconds configuration parameter of the sentinel.conf file. After failover finish (30sec in my case), everything restart working successfully.

Now it is totally normal that write operations fails when no master is active. My question remains on what are the best practices and how to use the library for it to work in the most transparent manner. Should every read/write operation have a retry mechanism with try/catch integrated in our application in order not to miss any request, or is there a way to configure StackExchange.Redis so that it does it by itself? I see the "ReconnectRetryPolicy" configuration section, but that's only the policies on re connections to the redis instances which works fine already, but not how on get/set operations themselves would retry until that reconnection is succesful.

I keep digging.

I believe what I'm asking for should theoretically be handled by the syncTimeout/ asyncTimeout configuration options. By setting an operation timeout longer than the sentinel down detection timeout, this should handle cases failover transparently and then only timeout if it takes longer, which means no instance is really accessible anymore. Right?

Unfortunately, it doesn't seems to work. My operation doesn't wait the configured timeout, the exception get throw as soon as the request occurs. And when looking at the exception that happens when I shut down the master, it's not a TimeoutException.

The first operation after a master shut down I receive:

StackExchange.Redis.RedisConnectionException: An unknown error occurred when writing the message
   à StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) dans /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:ligne 2803
   à StackExchange.Redis.RedisTransaction.Execute(CommandFlags flags) dans /_/src/StackExchange.Redis/RedisTransaction.cs:ligne 53

Then any subsequent operation I receive:

StackExchange.Redis.RedisConnectionException: No connection is active/available to service this operation: EXEC; An existing connection was forcibly closed by the remote host, mc: 1/1/0, mgr: 10 of 10 available, clientName: VD-RD-5, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1,Free=32766,Min=8,Max=32767), v: 2.1.55.31085

So it doesn't seem to wait and retry to find a master until the configured timeout is expired, but automatically throw when it see that no master is available.

Anyone who could just step in and provide any information would be greatly appreciated.

We do not currently provide an inbuilt retry mechanism for operations. Our view is that it is too risky in the general case, as we cannot make many assumptions about your intent, expectations, and allowable outcomes. Some operations are relatively "safe" to blindly retry; some could be very misleading, and some could be catastrophic. Retrying something when you cannot know the outcome of the original: is hard.

On Wed, 17 Jun 2020, 21:52 Dunge, notifications@github.com wrote:

I keep digging.

I believe what I'm asking for should theoretically be handled by the syncTimeout/ asyncTimeout configuration options. By setting an operation timeout longer than the sentinel down detection timeout, this should handle cases failover transparently and then only timeout if it takes longer, which means no instance is really accessible anymore. Right?

Unfortunately, it doesn't seems to work. My operation doesn't wait the configured timeout, the exception get throw as soon as the request occurs. And when looking at the exception that happens when I shut down the master, it's not a TimeoutException.

The first operation after a master shut down I receive:

StackExchange.Redis.RedisConnectionException: An unknown error occurred when writing the message

à StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) dans /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:ligne 2803

à StackExchange.Redis.RedisTransaction.Execute(CommandFlags flags) dans /_/src/StackExchange.Redis/RedisTransaction.cs:ligne 53

à VerMac.JamLogic.Server.DeviceLogs.RedisLogElement.AddRowInRedis(LogRow row) dans D:\Git\JamLogic3\Sources\JamLogic\Server\DeviceLogs\RedisLogElement.cs:ligne 281

StackExchange.Redis.RedisConnectionException: An unknown error occurred when writing the message

à StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) dans /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:ligne 2803

à StackExchange.Redis.RedisTransaction.Execute(CommandFlags flags) dans /_/src/StackExchange.Redis/RedisTransaction.cs:ligne 53

à VerMac.JamLogic.Server.DeviceLogs.RedisLogElement.AddRowInRedis(LogRow row) dans D:\Git\JamLogic3\Sources\JamLogic\Server\DeviceLogs\RedisLogElement.cs:ligne 281 à StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) dans /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:ligne 2803

à StackExchange.Redis.RedisTransaction.Execute(CommandFlags flags) dans /_/src/StackExchange.Redis/RedisTransaction.cs:ligne 53

Then any subsequent operation I receive:

StackExchange.Redis.RedisConnectionException: No connection is active/available to service this operation: EXEC; An existing connection was forcibly closed by the remote host, mc: 1/1/0, mgr: 10 of 10 available, clientName: VD-RD-5, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1,Free=32766,Min=8,Max=32767), v: 2.1.55.31085

So it doesn't seem to wait and retry to find a master until the configured timeout is expired, but automatically throw when it see that no master is available.

Anyone who could just step in and provide any information would be greatly appreciated.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/StackExchange/StackExchange.Redis/issues/1478#issuecomment-645618428, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEHMGDKLBKAOVI3MX6KV3RXEUKRANCNFSM4NQB7NDA .

Thanks for the reply.

So in other words, it's up to the application to decide what to do in case of failure depending on the critical nature of the data, to retry or to drop the operation, and how to do so (blocking the thread or fire and forget).

I asked this because other Redis client libs like for example ioredis seems to offers this feature (I haven't tried it) via retryStrategy autoResendUnfulfilledCommands and enableOfflineQueue.

In any case, I would lean toward implementing a retry mechanism in my application and wrap every requests around it with a configurable amount of retry or time elapsed. My question is, is there a list of exception type that should be caught instead of naively catch everything? Would only catching RedisConnectionException be okay or are there others to look for? Also, is there a possibility to receive this exception but that the server actually successfully processed the operation and my mechanism would send it twice? (this wouldn't be good).

Thanks

RedisServerException actively came from the server and should be safe to interpret as "the server got it and didn't process it", but anything else can only mean "we don't know". The message might never have left your machine. It might have left your machine and died en-route. It might have reached the server but not been processed. It might have reached the server and been processed. We just don't know.

Thank again! To be honest, from an external perspective both of your answer doesn't really inspire confidence on the stability of it all, but I appreciate telling it how it is. I will move ahead with implementing a system for protection our operations and splitting the critical ones from the non-critical ones and then see if our service can now survive a failover.

I'll now close the ticket since it was more of a question than a work item for you.

So we have a Redis setup with 1 master, 2 slaves and 3 sentinels (one on each machine).

In StackExchange.Redis configuration, all 3 addresses are specified separated by a ',' as mentioned on the Configuration documentation page.

Here's the configuration:
            var settings = ConfigurationOptions.Parse("10.255.0.20,10.255.0.21,10.255.0.22
");
            settings.ConnectRetry = 3;
            settings.ConnectTimeout = 10000;
            settings.SyncTimeout = 20000;

            Multiplexer = ConnectionMultiplexer.Connect(settings);
Redis by itself seems to works fine. When we shutdown the redis master, I can clearly see with redis-cli that sentinels do their job and select a slave switch to master to take over.

I also see that doing a ConnectionMultiplexer.Connect() will find the proper active server among the 3 to connect to when starting.

Unfortunately, when shutting down the server it is connected to while the application is running, it seems to timeout and throw an exception which in my case crash the process. Is the exception "normal" and need to be caught/retried manually, or should the library automatically try to connect to another server automatically when the selected one shut down?

I see in the changelog that in version 2.1.0 you added sentinel support, but there's no documentation for it. What exactly does it do? Does it prevent the need to write all 3 addresses in the config? It is necessary for high availability usage or was is working previously anyway? Is there any different function or configuration to adapt in the code to use it?

Just a bit of documentation on how this should be set up would be nice.

Here's the exception call stack in the Windows event viewer:
Description: The process was terminated due to an unhandled exception.
Exception Info: System.Net.Sockets.SocketException
   at Pipelines.Sockets.Unofficial.Internal.Throw.Socket(Int32)
   at Pipelines.Sockets.Unofficial.SocketAwaitableEventArgs.GetResult()
   at Pipelines.Sockets.Unofficial.SocketConnection+<DoReceiveAsync>d__74.MoveNext()

Exception Info: Pipelines.Sockets.Unofficial.ConnectionResetException
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
   at System.IO.Pipelines.Pipe.GetReadResult(System.IO.Pipelines.ReadResult ByRef)
   at System.IO.Pipelines.Pipe.GetReadAsyncResult()
   at StackExchange.Redis.PhysicalConnection+<ReadFromPipe>d__110.MoveNext()

Exception Info: StackExchange.Redis.RedisConnectionException

Exception Info: System.AggregateException

Exception Info: StackExchange.Redis.RedisConnectionException
   at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[[System.Int64, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]](StackExchange.Redis.Message, StackExchange.Redis.ResultProcessor`1<Int64>, StackExchange.Redis.ServerEndPoint)
   at StackExchange.Redis.RedisBase.ExecuteSync[[System.Int64, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]](StackExchange.Redis.Message, StackExchange.Redis.ResultProcessor`1<Int64>, StackExchange.Redis.ServerEndPoint)
   at MySystem.Server.Security.Membership.Sessions.Remove(System.String)
   at System.EventHandler.Invoke(System.Object, System.EventArgs)
   at System.ServiceModel.Channels.CommunicationObject.OnClosed()

Any solution for this issue

StackExchange / StackExchange.Redis

How to use for master failover in high availability scenario on a master/replica sentinel environment? #1478