StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.84k stars 1.5k forks source link

Connection established but 'The specified endpoint is not defined' #2728

Open Timmoth opened 1 month ago

Timmoth commented 1 month ago

I'm running a three node redis:7.2-alpine cluster on kubernetes, 1 master, 2 replicas, 3 sentinels. My config is here

In dotnet I am using this code to connect:

     var sentinelConfig = new ConfigurationOptions
        {
            AbortOnConnectFail = false,
            AllowAdmin = true,
            ConnectTimeout = 5000,
            ConnectRetry = 10,
            ServiceName = "mymaster",
            Proxy = Proxy.None,
            Ssl = false,
            KeepAlive = 10,
            ResolveDns = true,
            SyncTimeout = 5000,
            TieBreaker = "",
            Password = redisSettings.Password
        };

        foreach (var sentinel in redisSettings.Sentinels)
        {
            sentinelConfig.EndPoints.Add(sentinel.Host, sentinel.Port);
        }

        var redis = ConnectionMultiplexer.Connect(sentinelConfig, Console.Out);
        services.AddSingleton<IConnectionMultiplexer>(redis);

Which works fine when running a redis cluster in docker compose, it has also worked on/off in the k8 cluster. When it doesn't work the endpoint summary looks correct. As far as i can tell from the logs it's connected to the sentinels and resolved the correct ip / port for each redis endpoint, the exception thrown is the only thing I can tell that seems out of place:

06:42:16.9712: All 3 available tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=50,Max=1000), WORKER: (Busy=1,Free=32766,Min=50,Max=32767), POOL: (Threads=11,QueuedItems=0,CompletedItems=131,Timers=2 │
│ 06:42:16.9714: Endpoint summary:                                                                                                                                                                            │
│ 06:42:16.9716:   10.244.1.68:6379: Endpoint is (Interactive: ConnectedEstablished, Subscription: ConnectedEstablished)                                                                                      │
│ 06:42:16.9717:   10.244.0.85:6379: Endpoint is (Interactive: ConnectedEstablished, Subscription: ConnectedEstablished)                                                                                      │
│ 06:42:16.9718:   10.244.0.174:6379: Endpoint is (Interactive: ConnectedEstablished, Subscription: ConnectedEstablished)                                                                                     │
│ 06:42:16.9719: Task summary:                                                                                                                                                                                │
│ 06:42:16.9720:   10.244.1.68:6379: Returned with success as Standalone primary (Source: Connection race)                                                                                                    │
│ 06:42:16.9723:   10.244.0.85:6379: Returned with success as Standalone replica (Source: Already connected)                                                                                                  │
│ 06:42:16.9724:   10.244.0.174:6379: Returned with success as Standalone replica (Source: Already connected)                                                                                                 │
│ 06:42:16.9725: Election summary:                                                                                                                                                                            │
│ 06:42:16.9727:   Election: Single primary detected: 10.244.1.68:6379                                                                                                                                        │
│ 06:42:16.9728: 10.244.1.68:6379: Clearing as RedundantPrimary                                                                                                                                               │
│ 06:42:16.9729: Endpoint Summary:                                                                                                                                                                            │
│ 06:42:16.9731:   10.244.1.68:6379: Standalone v7.2.5, primary; 16 databases; keep-alive: 00:00:10; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active                                           │
│ 06:42:16.9732:   10.244.1.68:6379: int ops=13, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=7, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1                                                                          │
│ 06:42:16.9733:   10.244.1.68:6379: Circular op-count snapshot; int: 0+13=13 (1.30 ops/s; spans 10s); sub: 0+7=7 (0.70 ops/s; spans 10s)                                                                     │
│ 06:42:16.9735:   10.244.0.85:6379: Standalone v7.2.5, replica; 16 databases; keep-alive: 00:00:10; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active                                           │
│ 06:42:16.9736:   10.244.0.85:6379: int ops=14, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=7, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1                                                                          │
│ 06:42:16.9738:   10.244.0.85:6379: Circular op-count snapshot; int: 0+14=14 (1.40 ops/s; spans 10s); sub: 0+7=7 (0.70 ops/s; spans 10s)                                                                     │
│ 06:42:16.9739:   10.244.0.174:6379: Standalone v7.2.5, replica; 16 databases; keep-alive: 00:00:10; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
 06:42:16.9741:   10.244.0.174:6379: int ops=14, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=7, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1                                                                         │
│ 06:42:16.9742:   10.244.0.174:6379: Circular op-count snapshot; int: 0+14=14 (1.40 ops/s; spans 10s); sub: 0+7=7 (0.70 ops/s; spans 10s)                                                                    │
│ 06:42:16.9744: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
│ 06:42:16.9745: Starting heartbeat...                                                                                                                                                                        │
│ 06:42:16.9747: Total connect time: 35 ms                                                                                                                                                                    │
│ Unhandled exception. System.ArgumentException: The specified endpoint is not defined (Parameter 'endpoint')                                                                                                 │
│    at StackExchange.Redis.ConnectionMultiplexer.GetServer(EndPoint endpoint, Object asyncState) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 1247                                            │
│    at StackExchange.Redis.ConnectionMultiplexer.GetSentinelMasterConnection(ConfigurationOptions config, TextWriter log) in /_/src/StackExchange.Redis/ConnectionMultiplexer.Sentinel.cs:line 237           │
│    at StackExchange.Redis.ConnectionMultiplexer.SentinelPrimaryConnect(ConfigurationOptions configuration, TextWriter log) in /_/src/StackExchange.Redis/ConnectionMultiplexer.Sentinel.cs:line 134         │
│    at StackExchange.Redis.ConnectionMultiplexer.Connect(ConfigurationOptions configuration, TextWriter log) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 685

This suggests something might be wrong with my config? But the fact that it has worked on the cluster, and consistently works locally has me confused.

Does anyone have any ideas or would be able to provide me with some direction to trouble shoot?

NickCraver commented 3 weeks ago

This looks like Sentinel is not returning a valid endpoint (or one we recognize) when asked what the master is.

If you connect up directly and query sentinel master mymaster, what do you get back?

Tasteful commented 1 week ago

@NickCraver We have identified something similar with this, see https://github.com/samcook/RedLock.net/issues/112#issuecomment-2187152737

It exists cases when sentinel returns IP addresses that isn't longer included in the cluster, the connection multiplexer will work correctly and abort them during initialization, but the IConnectionMultiplexer.GetEndPoints() includes them and when executing the IConnectionMultiplexer.GetServer(endPoint) for an endpoint that not received and answer the ArgumentException is thrown.

Is the expectation that IConnectionMultiplexer.GetEndPoints() should return all entries that sentinel knows about?