StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.86k stars 1.5k forks source link

2.6.104 Change allows for synchronous writes to flood thread pool and timeout, 2.6.96 does not #2565

Closed supjohn1 closed 5 months ago

supjohn1 commented 10 months ago

We have multiple places that our system will timeout after moving to >= 2.6.104 I focused on a Parallel.ForEach that writes 28k strings into Redis. In <= 2.6.96 these writes will complete in <10seconds without any timeouts. Starting with 2.6.104 this will deadlock at some point during the writes (about 5-10k strings), will timeout and dispose itself.

I wrote a test to replicate this. The meat of the test is below. While I cannot figure out if this is symptom of the problem or simply a red-herring of a different problem, I was able to determine that if I comment out either Interlocked.Increment(ref _backlogCurrentEnqueued) OR Interlocked.Decrement(ref _backlogCurrentEnqueued), the writes seem to finish and not timeout.

https://github.com/StackExchange/StackExchange.Redis/blame/2.6.104/src/StackExchange.Redis/PhysicalBridge.cs#L824

The Increment line was added as part of 2.6.104. I did try several other options in place of commenting out these two lines like different lock options or Interlocked read for Volatile read. However, none of those changes solved the timeout. At this point I am at a loss because the solutions that should solve for a deadlock of _backlogCurrentEnqueued don't seem to so maybe I wonder if I have the wrong problem/solution.
We don't want to move forward without understanding how we should change or our code to resolve this or if maybe there is a fix that can be done in Stackexchange.Redis

Appreciate any thoughts and attention

        `Parallel.ForEach(stringList, st =>
        {
            scopedDb.StringSet(st.Key, st.Value, expiry: expirationTime, flags: StackExchange.Redis.CommandFlags.None);
            RedisValue what = scopedDb.StringGet(st.Key);
            Console.WriteLine(what.ToString());
        });`
NickCraver commented 10 months ago

Do you have a full repro here? It's worth noting by default the console logger is going to be blocking in ways that are super unfriendly so I'd need to see your full test to advise.

supjohn1 commented 10 months ago

@NickCraver Thank you for the quick reply. I tried to push my test but I lack permission. This is the test I have in AggresssiveTests.cs. lengthEachString and numString can be tweaked to fail timeout or pass. Let me know if there is a better way to get this to you or if you have questions.

`
[Fact] public void ParallelStringsetDeadlockTest() { // Get the current settings. //ThreadPool.GetMinThreads(out int minWorker, out int minIOC); //Debug.WriteLine($"*** {minWorker} {minIOC}");

   const int Muxers = 1;
   //actual data size are lengthEachString = ~3000 and numString = 28040
   //Minimum level to cause deadlock on my system int lengthEachString = 1500, numStrings = 10000;
   int lengthEachString = 3000, numStrings = 28040;

   System.Collections.Generic.List<System.Collections.Generic.KeyValuePair<string, string>> stringList = new System.Collections.Generic.List<System.Collections.Generic.KeyValuePair<string, string>>();

   var muxers = new IConnectionMultiplexer[Muxers];
   try
   {
       muxers[0] = Create();

       for(int i = 0; i < numStrings; i++)
       {
           System.Text.StringBuilder lineString = new System.Text.StringBuilder();
           for (int j = 0; j < lengthEachString; j++)
           {
               lineString.Append("a");
           }

           stringList.Add(new System.Collections.Generic.KeyValuePair<string, string>($"MyString{i}", lineString.ToString()));
       }

       var timespan = DateTime.Now.AddSeconds(60);
       var expirationTime = timespan.Subtract(DateTime.Now);

       var scopedDb = muxers[0].GetDatabase();
       scopedDb.StringSet("string1", "string1", expiry: expirationTime, flags: StackExchange.Redis.CommandFlags.None);
       //RedisValue what = scopedDb.StringGet("string1");
       //Console.WriteLine(what.ToString());
       Parallel.ForEach(stringList, st =>
       {
           scopedDb.StringSet(st.Key, st.Value, expiry: expirationTime, flags: StackExchange.Redis.CommandFlags.None);
           RedisValue what = scopedDb.StringGet(st.Key);
           Console.WriteLine(what.ToString());
       });
   }
   catch(Exception e)
   {
       Console.WriteLine(e.Message);
   }
   finally
   {
       for (int i = 0; i < muxers.Length; i++)
       {
           try { muxers[i]?.Dispose(); }
           catch { /* Don't care */ }
       }
   }

} `

NickCraver commented 10 months ago

@supjohn1 Thanks a ton for the repro - I'm slammed at the moment but will try to repro this week and report back.

supjohn1 commented 10 months ago

Really appreciate you taking time

adamjez commented 9 months ago

Hey guys, I have suspicion that I'm facing the same issue. I was unable to identify which specific version caused the weird behavior, but after upgrading from 2.6.86 to 2.6.122 version I'm seeing performance degradation.

Application is running without any problem for a some time until something happen and then:

We're running in multiple instances environment and at least 3 instances were affected in the same moment. The application is heavily dependent on Redis and does not use other storage.

I've got timeout exception from one of affected instances, but I don't think it contains any interesting information:

Timeout awaiting response (outbound=23KiB, inbound=782KiB, 518ms elapsed, timeout is 400ms), 
command=HMGET, next: HMGET CatalogProduct:156289, inst: 0, qu: 0, qs: 270, aw: False, bw: SpinningDown, 
rs: DequeueResult, ws: Idle, in: 0, in-pipe: 77112, out-pipe: 0, last-in: 0, cur-in: 14533, sync-ops: 0, async-ops: 18552056, 
serverEndpoint: xxx, conn-sec: 14875.07, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: product-api-9965766c5-vbgdm, 
IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=3,Free=32764,Min=32,Max=32767), 
POOL: (Threads=18,QueuedItems=19,CompletedItems=29219203,Timers=32), v: 2.6.122.38350 

I'm sorry that I've got no repro for this behavior. Should I create a new issue?

supjohn1 commented 8 months ago

@NickCraver I know you are very busy and I don't want to just call you out, hopefully there is someone else who helps with these topics, just curious if anyone had a chance to take a look at this?

supjohn1 commented 5 months ago

I'll close this. We are moving toward Async which does seem to mitigate the timeouts on the newer version.