StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.89k stars 1.51k forks source link

Reconfigure() does not reassign queue to new primary if a change occurs #2488

Closed Kilowhisky closed 1 year ago

Kilowhisky commented 1 year ago

I am trying to use StackExchange.Redis with Redis Sentinel in a HA environment.

I've encountered a scenario where the outgoing message queue is not transferred to the new primary. This causes all messages in that queue to eventually timeout with the following...

"System.AggregateException", "exception_message": "One or more errors occurred. (The message timed out in the backlog attempting to send because no connection became available (30000ms) - 
Last Connection Exception: It was not possible to connect to the redis server(s). 
ConnectTimeout, command=UNKNOWN, timeout: 30000, inst: 0, qu: 1, qs: 0, aw: False, bw: CheckingForTimeout, rs: NotStarted, ws: Initializing, last-in: 0, cur-in: 0, sync-ops: 0, async-ops: 184010, serverEndpoint: 172.19.0.2:9851, conn-sec: n/a, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: 298b0cdbaabc(SE.Redis-v2.6.114.25774), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=12,Free=32755,Min=24,Max=32767), POOL: (Threads=33,QueuedItems=23,CompletedItems=478572), v: 2.6.114.25774 
(Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts))

At the exact time of the error, this is the status of the Multiplexer: It clearly shows that there is a connected master ready and waiting to accept the message.

Multiplexer status is:
 22:37:38.4063: Endpoint Summary:\n
22:37:38.4064:   172.19.0.2:9851: Standalone v1.31.0, primary; keep-alive: 00:01:00; int: Connecting; sub: Connecting; not in use: DidNotRespond\n
22:37:38.4064:   172.19.0.2:9851: int ops=265098, qu=0, qs=0, qc=0, wr=0, socks=25; sub ops=9, qu=0, qs=0, qc=0, wr=0, socks=25\n
22:37:38.4064:   172.19.0.2:9851: Circular op-count snapshot; int: 265098 (0.00 ops/s; spans 10s); sub: 9 (0.00 ops/s; spans 10s)\n22:37:38.4064:   172.19.0.3:9851: Standalone v1.31.0, primary; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active\n
22:37:38.4064:   172.19.0.3:9851: int ops=22007, qu=0, qs=4, qc=0, wr=0, socks=1, non-pref=2; sub ops=10, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1\n
22:37:38.4064:   172.19.0.3:9851: Circular op-count snapshot; int: 13329+912+602+642+918+946+872+970+1131+951+734=22007 (867.80 ops/s; spans 10s); sub: 10 (0.00 ops/s; spans 10s)\n22:37:38.4064:   172.19.0.4:9851: Standalone v1.31.0, replica; keep-alive: 00:01:00; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active\n
22:37:38.4064:   172.19.0.4:9851: int ops=29, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=10, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1\n
22:37:38.4064:   172.19.0.4:9851: Circular op-count snapshot; int: 22+7=29 (0.70 ops/s; spans 10s); sub: 10 (0.00 ops/s; spans 10s)\n22:37:38.4064: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: 0s ago\n

All my writes have the DemandMaster flag set.

Here is my configuration:

var opts = new ConfigurationOptions
            {
                SyncTimeout = 30000,
                AbortOnConnectFail = false,
                Ssl = false,
                Password = config.Password,
                //ReconnectRetryPolicy = new LinearRetry(2000),
                ConnectRetry = 5,
                ServiceName = "tile38",
                //ConfigCheckSeconds = 10
            };
            opts.EndPoints.Add("tile38_1_sentinel", 26379);
            opts.EndPoints.Add("tile38_2_sentinel", 26379);
            opts.EndPoints.Add("tile38_3_sentinel", 26379);

            var log = new RedisLogWriter(_log, LogLevel.Trace);

            _multiplexerSentinel = await ConnectionMultiplexer.SentinelConnectAsync(opts, log);
            _multiplexerSentinel.ConfigurationChanged += (s, e) => LogMultiplexerEvent("ConfigurationChanged", s, e);
            _multiplexerSentinel.ConnectionFailed += (s, e) => LogMultiplexerEvent("ConnectionFailed", s, e);
            _multiplexerSentinel.ConnectionRestored += (s, e) => LogMultiplexerEvent("ConnectionRestored", s, e);
            _multiplexerSentinel.ConfigurationChangedBroadcast += (s, e) => LogMultiplexerEvent("ConfigurationChangedBroadcast", s, e);

            _multiplexer = _multiplexerSentinel.GetSentinelMasterConnection(opts, log);
            _multiplexer.ConfigurationChanged += (s, e) => LogMultiplexerEvent("ConfigurationChanged", s, e);
            _multiplexer.ConnectionFailed += (s, e) => LogMultiplexerEvent("ConnectionFailed", s, e);
            _multiplexer.ConnectionRestored += (s, e) => LogMultiplexerEvent("ConnectionRestored", s, e);
            _multiplexer.ConfigurationChangedBroadcast += (s, e) => LogMultiplexerEvent("ConfigurationChangedBroadcast", s, e);

I'm having to manually reconfigure the connection due to this issue: https://github.com/StackExchange/StackExchange.Redis/issues/2487

Here is my reconfiguration code:

public async Task DelayReconfigure()
{
    try
    {
        if (_isReconfiguring) return;
        _isReconfiguring = true;

        await Task.Delay(TimeSpan.FromSeconds(30));

        await _multiplexer.ReconfigureAsync("connection_lost");
        await _multiplexerSentinel.ReconfigureAsync("connection_lost");
        await _multiplexer.PublishReconfigureAsync();
        await _multiplexerSentinel.PublishReconfigureAsync();

        await Task.Delay(TimeSpan.FromSeconds(5));
        _log.LogInformation($"Reconfigured status: {_multiplexer.GetStatus()}");
        _log.LogInformation($"Reconfigured status (Sentinel): {_multiplexerSentinel.GetStatus()}");
    }
    finally { _isReconfiguring = false; }
}
Kilowhisky commented 1 year ago

I believe the issue occurs because of this https://github.com/StackExchange/StackExchange.Redis/blob/f6171a19a0be078c6528b4631d42dfa4adcc8564/src/StackExchange.Redis/ServerSelectionStrategy.cs#L168-L175

Redis servers connected using the Sentinel have a ServerType of Standalone. Which means TryResend always fails.

There probably needs to be a new ServerType of SentinelMember or something.

NickCraver commented 1 year ago

You're correct this wouldn't work - we won't try re-sending across different endpoints. This, in a race, can have very unintended and impossible to debug consequences, so we fail to the user and let them retry if wanted.

Overall, it's probably doubtful we'll add much more support (if any) for Sentinel behaviors because they are so complex, layered, and require more instances. The last I'm aware, Redis itself isn't developing this any further either, with cluster being the intended path going forward. Apologies this isn't working, but the complexity involved to support it already has a high cost. Changing how this works (it swaps out the endpoint entirely on a connection shift today) is a huge change we're just unlikely to make.

NickCraver commented 1 year ago

Closing this out to tidy up

ecooke-macu commented 12 months ago

@NickCraver do you have any sources for your statement about Sentinel no longer being actively developed by Redis? I can't seem to find anything about that. We're about to start implementing HA in Redis and were seriously looking at Sentinel due to the ease of configuration. We also heavily use the StackExchange.Redis library and this bug is unfortunate. Does this same thing happen when using their cluster type of HA? Is that fully supported by StackExchange.Redis?