Context:
We have a cluster consisting of 3 masters + 3 replicas (1 replica for each master). We are simulating situation when some of our cluster is down (in this example we shut down 3 replicas). Our .NET clients have all of the 6 nodes put into configuration. Also option "resolveDns=true" is set. (resolveDns is not crucial here, it's just because nodes know themselves through their ips so that I set it to have not duplicated connections by hostname and by ip after discovery).
When some of the nodes in the cluster are down they are not responsive. So they exhaust default connection timeout of 5000 milliseconds. They exhaust even much higher connection timeouts (tens of seconds). It was noticed in previous versions of StackExchange.Redis (<2.2.xx) that this behavior caused problems because after .Connect() randomly some of the nodes were unselectable by reason DidNotRespond. Upgrading to 2.6.xx mostly solved this issue (for which im glad) due to queueing feature. But I noticed that the underlying issue is still there and in some scenarios might cause problems so that's the reason I'm posting this. I found one use case in which it manifests but you might be aware of more.
Issue:
The first iteration of connecting to nodes takes 5 seconds and connects to 3 healthy nodes but fails to connect to remaining 3 wrong ones.
Second iteration issues PINGs to healthy nodes because of sendTracerIfConnected: true in server.OnConnectedAsync(log, sendTracerIfConnected: true, autoConfigureIfConnected: reconfigureAll);
Since first iteration exhaused timeout the second iteration exits immediately without waiting
For a brief moment after .Connect() some of the healthy nodes are undelectable even though connection succeeded in the first iteration
It causes issues in some scenarios (e.g SET with BacklogPolicy = FailFast
Reproduction code:
var options = ConfigurationOptions.Parse("server1:6116,server2:6116,server3:6116,server4:6116,server5:6116,server6:6116,password=some_password,resolveDns=true");
options.BacklogPolicy = BacklogPolicy.FailFast;
var connection = ConnectionMultiplexer.Connect(options, new LoggerWriter());
var status = connection.GetStatus();
connection.GetDatabase(0).StringSet("test2", "test2");
Exception:
No connection (requires writable - not eligible for replica) is active/available to service this operation: SET test2; It was not possible to connect to the redis server(s). Error connecting right now. To allow this multiplexer to continue retrying until it's able to connect, use abortConnect=false in your connection string or AbortOnConnectFail=false; in your code. ConnectTimeout, mc: 1/1/0, mgr: 10 of 10 available, clientName: MACHINENAME(SE.Redis-v2.6.122.38350), PerfCounterHelperkeyHashSlot: 8899, IOCP: (Busy=0,Free=1000,Min=16,Max=1000), WORKER: (Busy=0,Free=32767,Min=16,Max=32767), POOL: (Threads=23,QueuedItems=0,CompletedItems=526,Timers=17), v: 2.6.122.38350
I made a PR #2525 which seem to fix the problem perfectly in my case.
Additionally I found two more scenarios that manifest this issue:
when commands are executed in NoRedirect mode (as I learned it's by default done in Hangfire.Pro.Redis package)
when commands are executed in transaction-/ in MULTI. Then error "MOVED ..." is hidden behind "StackExchange.Redis.RedisServerException: EXECABORT Transaction discarded because of previous errors." so I assume StackExchange.Redis is not aware of it and can't requeue it.
Context: We have a cluster consisting of 3 masters + 3 replicas (1 replica for each master). We are simulating situation when some of our cluster is down (in this example we shut down 3 replicas). Our .NET clients have all of the 6 nodes put into configuration. Also option "resolveDns=true" is set. (resolveDns is not crucial here, it's just because nodes know themselves through their ips so that I set it to have not duplicated connections by hostname and by ip after discovery).
When some of the nodes in the cluster are down they are not responsive. So they exhaust default connection timeout of 5000 milliseconds. They exhaust even much higher connection timeouts (tens of seconds). It was noticed in previous versions of StackExchange.Redis (<2.2.xx) that this behavior caused problems because after .Connect() randomly some of the nodes were unselectable by reason DidNotRespond. Upgrading to 2.6.xx mostly solved this issue (for which im glad) due to queueing feature. But I noticed that the underlying issue is still there and in some scenarios might cause problems so that's the reason I'm posting this. I found one use case in which it manifests but you might be aware of more.
Issue:
sendTracerIfConnected: true
inserver.OnConnectedAsync(log, sendTracerIfConnected: true, autoConfigureIfConnected: reconfigureAll);
Reproduction code:
Exception:
Log: