StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.88k stars 1.51k forks source link

ConnectionImpl task.Result blocking on heavy load #2679

Open ryno1234 opened 6 months ago

ryno1234 commented 6 months ago

We're having issues, particularly when our server (IIS / ASP.Net 8.0) is under heavy load, where the creation of the ConnectionMultiplexer becomes a blocking thread for many other threads. Below is a screenshot from a memory dump of our production server showing that this line of code is blocking 1,700+ other threads.

image

We can seemingly get around this issue by preventing traffic from going to our server, starting our IIS site, requesting a page that interacts with Redis to almost "warm up" the Redis connection and then reenabling traffic. After that, everything seems to work fine, but if we have traffic going to the server out of the gate during the site's startup (such as after a new deployment), the creation of the multiplexer blocks about 50% of the time (sometimes it blocks, sometimes it doesn't... it all has to do with the traffic pattern).

I'm not sure if this is thread starvation or what. We're not using ConfigureAwait(false) since we're not on .Net Framework.

Here's our connection string configuration = {redactedhost:6379,syncTimeout=5000,allowAdmin=False,connectTimeout=5000,ssl=False,abortConnect=False,connectRetry=5}

We create a Singleton Multiplexer through this startup code

    public static IServiceCollection AddRedisCacheService(this IServiceCollection services)
    {
        services.TryAddSingleton<ICache, RedisCache>();

        // From AddStackExchangeRedisExtensions
        services.TryAddSingleton<IRedisClientFactory, RedisClientFactory>();
        services.TryAddSingleton<ISerializer>(new NewtonsoftSerializer(new JsonSerializerSettings { TypeNameHandling = TypeNameHandling.Auto }));

        services.TryAddSingleton<IRedisConnectionPoolManager, RedisConnectionPoolManager>();

        services.TryAddSingleton((provider) => provider
            .GetRequiredService<IRedisClientFactory>()
            .GetDefaultRedisClient());

        services.TryAddSingleton((provider) => provider
             // Block path is through these calls
            .GetRequiredService<IRedisClientFactory>()  
            .GetDefaultRedisClient()
            .GetDefaultDatabase());

        // Register a 'Configuration Factory' for extensions
        services.AddSingleton<IEnumerable<RedisConfiguration>>(sp => [sp.GetRequiredService<RedisConfiguration>()]);

        return services;
    }
mgravell commented 6 months ago

first thought: if this is becoming a blockage: how many multiplexers are you creating? Usually, they're shared and aggressively reused (concurrently), with a long lifetime - often process-long. You shouldn't usually need or want multiple.

However, indeed; looking closely, it looks to me like we're missing an else here; if we don't connect within the timeout, we shouldn't be touching task.Result - I wonder if this should simply be:

- if (!task.Result) throw ...
+ else if (!task.Result) throw ...

That make sense folks? /cc @NickCraver @philon-msft

ryno1234 commented 6 months ago

@mgravell, I updated my question to include context as to how we're creating our Multiplexer. We should only be creating 1.

mgravell commented 6 months ago

hard to say, really, since none of those types are "this library", but the mention of RedisConnectionPoolManager fills me with dread; in most scenarios, you don't need a pool - but: that at least explains multiple blocks here - that strongly suggests a bunch of multiplexers (again, not my type so I can't comment), but: working on a PR - you've called my attention to something that smells wrong

ryno1234 commented 6 months ago

@mgravell, you're right! Looks like we're creating 5 multiplexers... This is part of an extension library https://github.com/imperugo/StackExchange.Redis.Extensions ... I will look further into this part.

image

mgravell commented 6 months ago

OK; I will withhold comment on the external library as it is nothing to do with me ;)

PR in progress - I agree there's a problem there

mgravell commented 6 months ago

what I will say; the architecture of the current IO loop means that there is an unnecessary bottleneck in some of the processing; that might today be helped by using a pool, but: I'm actively working on a significant overhaul to the IO loop which completely removes this blockage - so it could be that there is a scenario today that does benefit from pooling, which will not in v3 (ETA: 1 month)

ryno1234 commented 6 months ago

First off, I really appreciate your quick response earlier. That really helped us refocus. Thank you.

For what it's worth, we updated the connection pool to just 1 multiplexer (again, I realize this isn't your type) and still experienced the blocking on that task.Result.

If we assumed that the other library that is facilitating this connection isn't the problem, what would cause our code to deadlock like this?

Put another way, I assume that although you did in fact find an issue in the code, it obviously isn't prevalent otherwise you would have heard a LOT of people talking about it, so, what could we be doing that is so different than anyone else that causes the deadlock?

mgravell commented 6 months ago

hard to say without being able to repro it on demand; however, this scenario applies specifically to the situation where we're trying to connect to a server, it hasn't replied promptly - so it could be something related to server unavailability or network instability. There is an accidental sync-over-async there which we are working on resolving (but in reviewing it yesterday, we identified a few knock-on things that we want to understand fully before changing it). The fact that the other library is using Connect rather than ConnectAsync is a contributory factory here.

IMO the "real" solution: use ConnectAsync instead of Connect.

mmriis commented 4 months ago

We encountered the same issue today... also using the mentions extension nuget packages.

@ryno1234 What solution did you end up with ?

ryno1234 commented 4 months ago

@mmriis, we removed the usage of https://github.com/imperugo/StackExchange.Redis.Extensions and went directly to using this library instead. We utilized aspects of the extensions library in our own code base (copy / paste / modify), namely the serialization aspects.

Our main issue was the race conditions with the connection, so we aggressively put a lock(){} statement around the connection knowing darn well we were blocking. We also do this within the factory that resolves the Multiplexer because we use the Multiplexer in our dependency injection.

mmriis commented 4 months ago

@ryno1234 Thank you. I was thinking we would do the same. Serialization/deserialization is pretty much the only thing we use from that library, so easy to recreate.

I don't understand why you would need the lock around the factory. Wouldn't ConnectionMultiplexer be a singleton with connections established on app start up ?

ryno1234 commented 4 months ago

@mmriis, whoops, you're right. I think the lock concept was something I was playing around with initially. Forgot that we went away from that for the reason you mentioned.

image

shwallmic commented 2 months ago

@mgravell My team started seeing this issue where the client is blocked on task.Result. This sometimes happens when our redis server is under extreme load (100% CPU). Given the client config supports a connect timeout, we had the expectation that the connection attempt would stop blocking after the timeout, but in our case the thread is blocked indefinitely causing unexpected issues in our service. I see above the suggestion to use ConnectAsync. But I wanted to check in and see if a fix for this issue is still planned.