StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.84k stars 1.5k forks source link

Random timeout issues with Azure Redis during re-authentication #2655

Closed Tasin5541 closed 4 months ago

Tasin5541 commented 4 months ago

We are using Azure Caching for Redis, and we're getting random timeout issues during re-authentication. The issue gets resolved after restarting the App Service.

Re-authentication of connection to 'Unspecified/********.redis.cache.windows.net:6380' failed:
StackExchange.Redis.RedisConnectionException: The message timed out in the backlog attempting to send because no connection
became available (60000ms) - Last Connection Exception: It was not possible to connect to the redis server(s). ConnectTimeout, 
command=AUTH, timeout: 60000, inst: 0, qu: 0, qs: 0, aw: False, bw: CheckingForTimeout, rs: ReadAsync, ws: Flushed, in: 0, last-in: 0, 
cur-in: 0, sync-ops: 0, async-ops: 99, serverEndpoint: ********.redis.cache.windows.net:6380, conn-sec: 0.34, aoc: 0, mc: 1/1/0, 
mgr: 9 of 10 available, clientName: f670cb663336(SE.Redis-v2.7.4.20928), IOCP: (Busy=0,Free=1000,Min=500,Max=1000), WORKER: 
(Busy=84,Free=32683,Min=500,Max=32767), POOL: (Threads=246,QueuedItems=0,CompletedItems=292541785,Timers=2208), v: 
2.7.4.20928 (Please take a look at this article for some common client-side issues that can cause timeouts: 
https://stackexchange.github.io/StackExchange.Redis/Timeouts)

This is a .Net 6 application and we're using Microsoft.Azure.StackExchangeRedis (Version: 2.0.0) for the connection with the ConfigureForAzureWithTokenCredentialAsync method. The underlying StackExchange.Redis version is 2.7.4

philon-msft commented 4 months ago

Looks like this cache is running at the max 2k client connections, so some instances of SE.Redis are unable to restore connections and execute any commands. The client app may have a connection leak, but if the high number of connections is expected, I'd recommend scaling up to a larger cache SKU that can handle more concurrent connections. There also appears to be a high rate of connections created/closed per second, which is driving server load to 100%. To resolve that, be sure reuse long-lived connections rather than creating/closing frequently.

Tasin5541 commented 4 months ago

@philon-msft is there a doc that would help understand what each value in the error means? How do we know that the cache is running at max 2k client connections?

philon-msft commented 4 months ago

See Explanation for abbreviations appearing in exception messages to decode those values. I'm on the Azure Cache team, so I was able to check the connection metrics for your cache. Looks like metrics returned to healthy levels ~4hrs ago. Have the exceptions also stopped?

Tasin5541 commented 4 months ago

The exceptions stopped after restarting the app. Had the same situation 2 days ago.

Could you point out specifically where we should be looking on Azure for possible causes of these exceptions?

philon-msft commented 4 months ago

Root cause is the connection leak and create/close churn, which is most likely caused by a bug in the client app. To check cache metrics for signs of overload, see Monitor Azure Cache for Redis

For more tips on troubleshooting see: https://learn.microsoft.com/azure/azure-cache-for-redis/cache-troubleshoot-timeouts

Tasin5541 commented 4 months ago

Since only one app is connecting to the redis cache server, does it mean we should see only one client connection in the graph?

I've been looking into the connection process samples from MS and found this which seems a bit old and no mention of usage with DefaultCredentials. The sample here does show how the connection could be created but lacks proper samples on how this could be used with a .Net api project as a singleton service.

NickCraver commented 4 months ago

@Tasin5541 When are you connecting in the application? We also see this when people are starting connections per-scope, e.g. per request and such when the multiplexer is intended to live for the life of the application and be shared.

Tasin5541 commented 4 months ago

@NickCraver we have a singleton service that is used for the connection. This service is called at the start of our app for caching some objects. After that the service is used on per request basis.

This is part of the service that handles the connection:

public RedisService()
{
    if (_useCache)
    {
        _connectionMultiplexer = CreateConnection(_authenticationType).Result;
    }
}

private async Task<ConnectionMultiplexer> CreateConnection(string option)
{
    ConnectionMultiplexer connectionMultiplexer = _connectionMultiplexer;
    int connectionTimeoutInMilliSeconds = 60000;

    if (_connectionMultiplexer is null)
    {
        try
        {
            switch (option)
            {
                case "TokenCredential":
                    configurationOptions = await ConfigurationOptions.Parse($"{_connectionString}:6380").ConfigureForAzureWithTokenCredentialAsync(_principalId, new DefaultAzureCredential());
                    configurationOptions.AbortOnConnectFail = false;
                    configurationOptions.AsyncTimeout = connectionTimeoutInMilliSeconds;
                    configurationOptions.SyncTimeout = connectionTimeoutInMilliSeconds;
                    LogTokenEvents(configurationOptions);

                    connectionMultiplexer = await ConnectionMultiplexer.ConnectAsync(configurationOptions);
                    break;

                default:
                    connectionMultiplexer = null;
                    break;
            }
        }
        catch (Exception ex)
        {
        }
    }

    return connectionMultiplexer;
}

Then on startup:

services.AddSingleton<RedisService>();

public void Configure(IServiceProvider serviceProvider)
{
    serviceProvider.GetRequiredService<RedisService>().CachObjects();
}
Tasin5541 commented 4 months ago

It turns out the RedisService we were using for connection was converted to a scoped service at some point. Adding it as singleton fixed the issue.

The Connected Clients count was hitting the max count because of new connections getting created every time and the Microsoft.Azure.StackExchangeRedis library keeping it alive by itself.

image