StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.92k stars 1.51k forks source link

ConnectionMultiplexer holding references to 4+ GB memory #2307

Open RamjotSingh opened 2 years ago

RamjotSingh commented 2 years ago

We have been having a few instances of our machines low on memory so we were doing a bunch of investigations. In one of the investigations, something peculiar showed up which I could not make a whole lot of sense of.

Here is what dotMemory shows for a specific memory dump.

The string problem we have here, we are aware of it and working on fixing (it doesn't concern caching). The concerning part is 4.04 GB retained by ConnectionMultiplexer

image

I tried looking deeper into it. It looks like the Redis SDK holds a long Queue of messages

image

Wanted to understand if this is expected or are we seeing\hitting a bug somewhere, either in the way we are setup OR the way SDK behaves. Also how to interpret this retained memory snapshot in general.

We are using .Net Core 3.1.x Redis caching extension lib which uses StackExchangeRedis 2.0.593

mgravell commented 2 years ago

I'm very surprised by a queue of that length. Does the tooling indicate which field this is rooted by? I'm presuming that this is the "sent, pending result" queue, but: that really shouldn't be 2M long! We could perhaps put a max size on this backlog queue. Any idea what was happening more holistically here? High throughput? Disconnect? Clues...

On Fri, 18 Nov 2022, 20:28 Ramjot Singh, @.***> wrote:

We have been having a few instances of our machines low on memory so we were doing a bunch of investigations. In one of the investigations, something peculiar showed up which I could not make a whole lot of sense of.

Here is what dotMemory shows for a specific memory dump.

The string problem we have here, we are aware of it and working on fixing (it doesn't concern caching). The concerning part is 4.04 GB retained by ConnectionMultiplexer

[image: image] https://user-images.githubusercontent.com/13517857/202769497-fee8a4e5-57c8-41cf-98de-3761d576096a.png

I tried looking deeper into it. It looks like the Redis SDK holds a long Queue of messages

[image: image] https://user-images.githubusercontent.com/13517857/202768673-d97f9df4-e0ea-4e9e-930e-d67f7c68e2d5.png

Wanted to understand if this is expected or are we seeing\hitting a bug somewhere, either in the way we are setup OR the way SDK behaves. Also how to interpret this retained memory snapshot in general.

We are using .Net Core 3.1.x Redis caching extension lib which uses StackExchangeRedis 2.0.593

— Reply to this email directly, view it on GitHub https://github.com/StackExchange/StackExchange.Redis/issues/2307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEHMHEGQ6QQCMKXMLVRF3WI7RHFANCNFSM6AAAAAASE3IKB4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

RamjotSingh commented 2 years ago

I am not sure honestly and kinda stumped. I do have access to the individual values too (since this is coming from memory dump). anything in particular I could look for? I had a random hunch that maybe the Value of the fetched key was pushed to LOH but that isnt the case either (confirmed in code too it should never go to LOH).

NickCraver commented 2 years ago

Is there any chance you have a timeout error when this happens? The text in that might help us advise quickly.

RamjotSingh commented 2 years ago

I looked a bit deeper and don't really see any timeouts. I am assuming that I need to look in a window of days. I see few failures.

The general trend is No connection is available to service this operation.

Sample error

Exception 'No connection is available to service this operation: HMGET <RedactedKey>; IOCP: (Busy=1,Free=999,Min=400,Max=1000), WORKER: (Busy=183,Free=32584,Min=400,Max=32767), Local-CPU: n/a'. StackTrace 'StackExchange.Redis.RedisConnectionException: No connection is available to service this operation: HMGET <RedactedKey>; IOCP: (Busy=1,Free=999,Min=400,Max=1000), WORKER: (Busy=183,Free=32584,Min=400,Max=32767), Local-CPU: n/a',

NickCraver commented 2 years ago

@RamjotSingh That 183 worker threads busy indicated thread pool overload, which would result in a ton of backup happening. Are you doing synchronous calls overall? (I'm about to add a new stat for this). In any case: I can tell from the info here this is a very old version of the library - you'll want to upgrade here because a lot of changes have been made and we can't really solve anything for an old version. The debug info available will be far greater as well.

RamjotSingh commented 2 years ago

We run at a rather high requests per second so 183 worker threads isn't that a high a value for us (our min limit is at 400).

Regarding the SDK yes as mentioned we don't directly pull StackExchange.Redis but pull it through NetCore 3.1.x.

That being said if we were to upgrade from 2.0.593 to the latest one, should we anticipate or be on lookout for any breaking changes?

NickCraver commented 2 years ago

@RamjotSingh can you quantify "rather high" here? What numbers are we talking on the machine, and how many cores?

To answer your other question: there are no breaking changes known - we try to follow SemVer here.

RamjotSingh commented 2 years ago

These are typically 16 core machines taking abt 400-500 requests in parallel on a machine. But machine is running a lot more services (and doing a lot of processing too). We can approx that a particular machine running this particular process would be at abt 100 requests per second.

I am upgrading us to latest SDK version.

NickCraver commented 2 years ago

@RamjotSingh Your thread count is extremely high, likely due to either too much load on the CPU in general or synchronous calls that are stalled waiting on a lock or an I/O operation to complete. I'd recommend using your memory dump to see where all those threads are in their stacks. This looks like a pile-up from every standpoint.

RamjotSingh commented 1 year ago

I looked at the memory dump and only 13 threads were active (our min is set to 400 so .net wont clean them up). I did not see any other stalling etc. We are upgrading the package to latest because we saw this same thing happen on a bunch of memory dumps (not always but multiple cases where GBs worth of references are kept).

ShaneCourtrille commented 1 year ago

This might be related to #2070. Have you looked at a memory dump to see if you have the a large amount of RawResult[] objects taking up live and dead memory?

ppilev commented 1 month ago

hello @NickCraver

I've encountered the same problem because I was using StackExchange.Redis.Extensions.Core library which is using IConnectionMultiplexer and creates connections pool.

Now I was able to reproduce the problem with a very simple app where directly created 25 instances of IConnectionMultiplexer and stored them in a static list:

var conn = ConnectionMultiplexer.Connect("localhost");

then I've started making requests using each connection:

            for (int i = 0; i < 1000; i++)
            {
                var db = connections[ i % DbCounts].GetDatabase();
                db.HashSet("test", $"value-{i}", i);
                var result = db.HashGet("test", $"value-{i}");
            }

Collected the dump after executed above code few times:

Statistics:
          MT Count  TotalSize Class Name
7ffe3f730990   400 91,760,000 StackExchange.Redis.RawResult[]
Total 400 objects, 91,760,000 bytes

Here is the gcroot for one of the StackExchange.Redis.RawResult[] objects:

> gcroot 01e1e6771498
HandleTable:
    000001dbb5f511f8 (strong handle)
          -> 01e236311018     System.Object[]
          -> 01dfb6490c78     System.Collections.Generic.List<StackExchange.Redis.IConnectionMultiplexer> (static variable: System.String[].Summaries)
          -> 01dff6422598     StackExchange.Redis.IConnectionMultiplexer[]
          -> 01df765891b8     StackExchange.Redis.ConnectionMultiplexer
          -> 01df76589330     System.Collections.Hashtable
          -> 01df76589378     System.Collections.Hashtable+bucket[]
          -> 01df765896b8     StackExchange.Redis.ServerEndPoint
          -> 01df76589960     StackExchange.Redis.PhysicalBridge
          -> 01df76589ef0     StackExchange.Redis.PhysicalConnection
          -> 01df76589fb8     Pipelines.Sockets.Unofficial.Arenas.Arena<StackExchange.Redis.RawResult>
          -> 01e1e6771498     StackExchange.Redis.RawResult[]

Thread 100dc:
    8c5fff640 7ffe9bcfbccf System.Threading.PortableThreadPool+GateThread.GateThreadStart() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.GateThread.cs @ 54]
        rbp-98: 00000008c5fff6e8
          -> 01e236311018     System.Object[]
          -> 01dfb6490c78     System.Collections.Generic.List<StackExchange.Redis.IConnectionMultiplexer> (static variable: System.String[].Summaries)
          -> 01dff6422598     StackExchange.Redis.IConnectionMultiplexer[]
          -> 01df765891b8     StackExchange.Redis.ConnectionMultiplexer
          -> 01df76589330     System.Collections.Hashtable
          -> 01df76589378     System.Collections.Hashtable+bucket[]
          -> 01df765896b8     StackExchange.Redis.ServerEndPoint
          -> 01df76589960     StackExchange.Redis.PhysicalBridge
          -> 01df76589ef0     StackExchange.Redis.PhysicalConnection
          -> 01df76589fb8     Pipelines.Sockets.Unofficial.Arenas.Arena<StackExchange.Redis.RawResult>
          -> 01e1e6771498     StackExchange.Redis.RawResult[]

Found 2 unique roots.

honestly I cannot tell that's the correct way of using multiplexing but I was using that lib to address some client timeouts which occurred otherwise.

Regards,

NickCraver commented 1 month ago

Are we saying with 25 multiplexers we have a total of 400 objects totaling 91.7 MB? Depending on what's in your hash, and what's in progress, that doesn't seem excessive (whereas gigabytes would raise eyebrows).

Generally I wouldn't suggest that many multiplexed connections, only a few is what most use cases would top out at the usefulness of.