StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.85k stars 1.5k forks source link

Sentinel Connect over TLS works then switches to IP causing failure #2617

Closed joshbartley closed 6 months ago

joshbartley commented 6 months ago

I have a new Redis Cluster using Sentinel setup to only allow for TLS. Certificates a wildcards issued from a public CA and expire in 6 months.

I've set the following options on all the servers.

sentinel.conf (hostnames changed to example.com)

tls-port 26379
tls-replication yes
tls-cert-file "xxxx.crt"
tls-key-file "xxxx.key"
tls-ca-cert-file "xxxxxx.crt"
tls-auth-clients no

sentinel resolve-hostnames yes
sentinel announce-ip "redi1.example.com"
sentinel announce-hostnames yes

redis.conf (hostnames changed to example.com)

tls-port 6379
cluster-announce-hostname "redi1.example.com"
replica-announce-ip "redi1.example.com" 
tls-cert-file "xxxx.crt"
tls-key-file "xxxx.key"

From the ILoggerFactory I get success messages at first.

info: StackExchange.Redis.ConnectionMultiplexer[0]
      TLS connection established successfully using protocol: Tls12
info: StackExchange.Redis.ConnectionMultiplexer[0]
      redi1.example.com:6379/Interactive: Connected

Then things take a turn and I'm not sure where the code is getting the IPs instead of the hostnames.

info: StackExchange.Redis.ConnectionMultiplexer[0]
        192.168.2.103:6379: Endpoint is (Interactive: Connecting, Subscription: Connecting)

AuthenticationFailure on 192.168.2.103:6379/Interactive, Initializing/NotStarted, last: NONE, origin: ConnectedAsync, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, last-mbeat: 0s ago, global: 0s ago, v: 2.7.10.12442 StackExchange.Redis.RedisConnectionException: AuthenticationFailure on 192.168.2.103:6379/Interactive, Initializing/NotStarted, last: NONE, origin: ConnectedAsync, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, last-mbeat: 0s ago, global: 0s ago, v: 2.7.10.12442 ---> System.Security.Authentication.AuthenticationException: The remote certificate was rejected by the provided RemoteCertificateValidationCallback.

I connected to Sentinel and ran the below commands and both come back with the hostnames instead of the IP except for get-master-addr-by-name which returns an IP but I think it's supposed to?

SENTINEL replicas SENTINEL sentinels

If I set the connection string to the below. Everything seems to work.

redi1.example.com,ssl=true,serviceName=redis1,user=app,password=xxxxxxxxxxxxxxxxxxxx"

If I add the other nodes in, this is when errors start to occur. redi1.example.com,redi2.example.com,redi3.example.com,ssl=true,serviceName=redis1,user=app,password=xxxxxxxxxxxxxxxxxxxx"

It is very likely I have some config line messed up on the redis server side but I don't know where to find it or how those IPs are being used instead of the hostname.

Testing code using .net8

internal class Program
{
    static async Task Main(string[] args)
    {
        ILoggerFactory loggerFactory = LoggerFactory.Create(builder =>
        {
            builder
                .AddFilter("Microsoft", LogLevel.Warning)
                .AddFilter("System", LogLevel.Warning)
                .AddFilter("LoggingConsoleApp.Program", LogLevel.Debug)
                .SetMinimumLevel(LogLevel.Debug)
                .AddConsole();
        });

        string connectionString = args[0];

        var connection = ConnectionMultiplexer.Connect(connectionString,x=> {
            x.LoggerFactory = loggerFactory;
            x.CertificateValidation += (sender, certificate, chain, sslPolicyErrors) =>
            {
                Console.WriteLine($"Certificate {certificate.Subject}");
                return sslPolicyErrors== System.Net.Security.SslPolicyErrors.None;
            };
        });
        var redis = connection.GetDatabase();

        long i = 0;

        while(true) 
        {
            Console.WriteLine($"Begin loop {i}");

            try
            { 
                string oldKey = await redis.StringGetAsync("TestKey");

                await redis.StringSetAsync("TestKey", i.ToString());
                Console.WriteLine($"Successfully set TestKey with value {i} with old key {oldKey}");
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex);
            }
            i++;
            await Task.Delay(5_000);
        }
    }
}
KendoSai commented 6 months ago

Have you ran these commands for redi2.example.com and redi3.example.com yet?

replica-announce-ip <hostname>
sentinel announce-ip <hostname> 

p/s:
You should use hostnames everywhere and avoid mixing hostnames and IP addresses. To do that, use replica-announce-ip and sentinel announce-ip for all Redis and Sentinel instances, respectively. https://redis.io/docs/management/sentinel/#ip-addresses-and-dns-names

joshbartley commented 6 months ago

In the sentinel config, every server has

sentinel resolve-hostnames yes
sentinel announce-ip "redis1.example.com"
sentinel announce-hostnames yes

Also at the bottom of the sentinel.config it lists the sentinel known-replica and sentinel known-sentinel using their hostnames and not their IPs. replica-announce-ip is already set according to the redis.conf and listed above in the example.

joshbartley commented 6 months ago

I think I figured it out though replicating it will be tough.

  1. Redis cluster setup with sentinel and IPs, start cluster
  2. Setup cluster with TLS, switch everything over to hostnames.

I think what happened is that a redis primary was already picked, and it was the old IP. Since a failover never happened, it never updated the primary to the hostname. I forced a failover and hostnames started to come back for the SENTINEL get-master-addr-by-name command. I rotated all through every host and verified that all are back to hostname and was able to drop the ssl override. Looks like this was a redis issue from an order of operations not a client issue. Apologies for that.

KendoSai commented 6 months ago

Glad you have found the way to solve it.