StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.86k stars 1.51k forks source link

Sentinel +switch-master announcement missed if master dies while connected #2487

Open Kilowhisky opened 1 year ago

Kilowhisky commented 1 year ago

I'm testing Redis, StackExchange.Redis, and Redis-Sentinel in order to setup a appropriate HA environment.

I've noticed that when i stand up a sentinel cluster as described in these docs: https://hub.docker.com/r/bitnami/redis-sentinel/ and i shutdown the master after a multiplexer connection has been successfully established by a client using StackExchange.Redis. The multiplexer fails to realize that the failover has occurred and continues to queue messages for the now non-existent leader.

Funny thing is If i manually trigger a failover using the sentinel-cli with SENTINEL FAILOVER mymaster the multiplexer realizes the change and appropriately updates the leader.

I looked into the code and noticed that the +switch-master subscription hook is never fired when the primary Redis is shutdown even if the sentinel associated is still up and running. In the situation where i manually trigger the failover, the hook properly fires.

https://github.com/StackExchange/StackExchange.Redis/blob/f6171a19a0be078c6528b4631d42dfa4adcc8564/src/StackExchange.Redis/ConnectionMultiplexer.Sentinel.cs#L33-L62

I've confirmed that the sentinel is still issuing the +switch-master command in both scenarios.

I've tried multiple ways to setup the multiplexer, here is my last iteration.

var opts = new ConfigurationOptions
            {
                SyncTimeout = 60000,
                AbortOnConnectFail = false,
                Ssl = false,
                Password = config.Password,
                ReconnectRetryPolicy = new LinearRetry(2000),
                ConnectRetry = 30,
                ServiceName = "mymaster",
                ConfigCheckSeconds = 10
            };
            opts.EndPoints.Add("redis-sentinel-1", 26379);
            opts.EndPoints.Add("redis-sentinel-2", 26379);
            opts.EndPoints.Add("redis-sentinel-3", 26379);

            var log = new LoggerWriter(_log, LogLevel.Information);

            //_multiplexer = await ConnectionMultiplexer.ConnectAsync(opts, log);
            _multiplexerSentinel = await ConnectionMultiplexer.SentinelConnectAsync(opts, log);
            _multiplexer = _multiplexerSentinel.GetSentinelMasterConnection(opts, log);

            _multiplexerSentinel.ConfigurationChanged += (s, e) =>
            {
                _log.LogError($"Sentinel configuration changed (Sentinel): {e.EndPoint}");
                _log.LogInformation($"New server status: {_multiplexerSentinel.GetStatus()}");
            };
            _multiplexerSentinel.ConnectionFailed += (s, e) =>
            {
                _log.LogError($"Connection failed (Sentinel): {e.ConnectionType}, {e.FailureType}, {e.EndPoint}, {e.Exception}");
                _log.LogInformation($"New server status: {_multiplexerSentinel.GetStatus()}");
            };
            _multiplexerSentinel.ConnectionRestored += (s, e) =>
            {
                _log.LogError($"Connection restored (Sentinel): {e.ConnectionType}, {e.FailureType}, {e.EndPoint}, {e.Exception}");
                _log.LogInformation($"New server status: {_multiplexerSentinel.GetStatus()}");
            };
            _multiplexerSentinel.ConfigurationChangedBroadcast += (s, e) =>
            {
                _log.LogError($"Connection changed via broadcast (Sentinel): {e.EndPoint}");
                _log.LogInformation($"New server status: {_multiplexerSentinel.GetStatus()}");
            };

            _multiplexer.ConfigurationChanged += (s, e) =>
            {
                _log.LogError($"Configuration changed: {e.EndPoint}");
                _log.LogInformation($"New server status: {_multiplexer.GetStatus()}");
                //var info = _multiplexer.GetServers().Where(x => x.IsConnected).Select(x => x.InfoRaw()).ToArray();
               // _multiplexer.Configure(new LoggerWriter(_log, LogLevel.Information));
                //_log.LogInformation($"Server info: {info}");
                // _multiplexer.PublishReconfigure();
            };
            _multiplexer.ConnectionFailed += (s, e) =>
            {
                _log.LogError($"Connection failed: {e.ConnectionType}, {e.FailureType}, {e.EndPoint}, {e.Exception}");
                _log.LogInformation($"New server status: {_multiplexer.GetStatus()}");
            };
            _multiplexer.ConnectionRestored += (s, e) =>
            {
                _log.LogError($"Connection restored: {e.ConnectionType}, {e.FailureType}, {e.EndPoint}, {e.Exception}");
                _log.LogInformation($"New server status: {_multiplexer.GetStatus()}");
            };
            _multiplexer.ConfigurationChangedBroadcast += (s, e) =>
            {
                _log.LogError($"Connection changed via broadcast: {e.EndPoint}");
                _log.LogInformation($"New server status: {_multiplexer.GetStatus()}");
            };
Kilowhisky commented 1 year ago

This workaround works but its more than a bit hacky...

My failover period is 20 seconds so 30 should be enough time to elect a new leader.

            var reconfiguring = false;
            _multiplexer.ConnectionFailed += (s, e) =>
            {
                _log.LogError($"Connection failed: {e.ConnectionType}, {e.FailureType}, {e.EndPoint}, {e.Exception}");

                _ = Task.Run(async () =>
                {
                    try
                    {
                        if (reconfiguring) return;
                        reconfiguring = true;
                        await Task.Delay(TimeSpan.FromSeconds(30));
                        await _multiplexer.ReconfigureAsync("connection_lost");
                        await _multiplexer.PublishReconfigureAsync();
                        await Task.Delay(TimeSpan.FromSeconds(5));
                        _log.LogInformation($"Reconfigured status: {_multiplexer.GetStatus()}");
                    }finally { reconfiguring = false; }
                });
            };
NickCraver commented 1 year ago

The intended usage for Sentinel today is to connect to the sentinel endpoints with a service name, and the ConnectionMultiplexer you get back will be connected to the current primary. When the primary switches, it'll follow (there are 2 multiplexers under the covers, with 1 listening to Sentinel).

It looks like you may have tried this given the commented out line in code there looking to do this. What was or wasn't happening with that setup and nothing else?