StackExchange / StackExchange.Redis

General purpose redis client
https://stackexchange.github.io/StackExchange.Redis/
Other
5.92k stars 1.51k forks source link

StringGetAsync() hangs on SocketManager.WriteAllQueues()+0x72 #597

Closed galiniliev closed 6 years ago

galiniliev commented 7 years ago

Hi, we are using 1.1.608 release and occasionally we see calls to StringGetAsync() getting blocked and timing out the WebAPI calls. looking at the call stacks we see following: StackExchange_Redis_StrongName!StackExchange.Redis.SocketManager.WriteAllQueues()+0x72 0a 000007f97a95674e : 0000000ecbfc64c0 0000000ecbfc62b0 0000000000000000 0000000000000000 : StackExchange_Redis_StrongName!StackExchange.Redis.SocketManager+<>c.<.cctor>b__49_0(System.Object)+0x38 0b 000007f97a9565e7 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+0x15e 0c 000007f97a9565a2 : 0000000ecbfc65b0 0000000000000000 0000000000000000 0000000000000000 : mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+0x17 0d 000007f97b32e24d : 000000131a8bf438 000007f97bac5552 0000000000000012 000000131a8bf588 : mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)+0x52 0e 000007f97bac6a53 : 0000000ecbfc65d8 0000000ecbfc62b0 0000000000000000 0000000000000000 : mscorlib_ni!System.Threading.ThreadHelper.ThreadStart(System.Object)+0x5d

looking at SocketManager there is Monitor.Wait(writeQueue); which is mixing sync and async pattern. out code is fully async. Is anyone else see same problem?

niemyjski commented 7 years ago

I think I'm also seeing this as well but unable to track it down. I wish this client would just remove all sync support

niemyjski commented 7 years ago

I think I'm hitting this as well. I have nothing in my dmp file other than a ton of waiting empty threads and then this

>   mscorlib.dll!System.Threading.Monitor.Wait(object obj, int millisecondsTimeout, bool exitContext)   Unknown
    StackExchange.Redis.dll!StackExchange.Redis.SocketManager.WriteAllQueues()  Unknown
    StackExchange.Redis.dll!StackExchange.Redis.SocketManager..cctor.AnonymousMethod__49_0(object context)  Unknown
    mscorlib.dll!System.Threading.ThreadHelper.ThreadStart_Context(object state)    Unknown
    mscorlib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state, bool preserveSyncCtx)   Unknown
    mscorlib.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state, bool preserveSyncCtx)   Unknown
    mscorlib.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state) Unknown
    mscorlib.dll!System.Threading.ThreadHelper.ThreadStart(object obj)  Unknown
NickCraver commented 7 years ago

@niemyjski there should be 1 per ConnectionMultiplexer (by default), are you dealing with many connections? The manager can also be shared between many multiplexers, if you want.

GePerez commented 7 years ago

Hello folks: I believe we are also hitting this issue (we are using version 1.2.0). Memory dumps of machines that have leaked memory have nothing but hundreds of threads on:

     35  Id: 235c.2220 Suspend: 0 Teb: 00007ff5`ff4a0000 Unfrozen
    Child - SP          RetAddr Call Site
    00 00000060`63f0dd48 00007ff9`e7f9ccb4 ntdll!ZwWaitForSingleObject + 0xa
    01 00000060`63f0dd50 00007ff9`e7f9aec5 mswsock!SockWaitForSingleObject + 0x65
    02 00000060`63f0ddd0 00007ff9`e99dffd3 mswsock!WSPSelect + 0x79c
    03(Inline Function)--------`--------ws2_32!DPROVIDER::WSPSelect + 0x35
    04 00000060`63f0df70 00007ff9`8436b931 ws2_32!select + 0x1db
    05 00000060`63f0e060 00007ff9`8436a28e search!DomainBoundILStubClass.IL_STUB_PInvoke(Int32, IntPtr[], IntPtr[], IntPtr[], TimeValue ByRef) + 0xe1
    06 00000060`63f0e150 00007ff9`8436999e StackExchange_Redis!StackExchange.Redis.SocketManager.ReadImpl() + 0x7fe
    07 00000060`63f0e3c0 00007ff9`e13d2d45 StackExchange_Redis!StackExchange.Redis.SocketManager.Read() + 0x5e
    08 00000060`63f0e410 00007ff9`e13d2ab9 mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) + 0x285
    09 00000060`63f0e570 00007ff9`e13d2a97 mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) + 0x9
    0a 00000060`63f0e5a0 00007ff9`e1ca3a8d mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) + 0x57
    0b 00000060`63f0e5f0 00007ff9`e294ab53 mscorlib_ni!System.Threading.ThreadHelper.ThreadStart(System.Object) + 0x5d
    0c 00000060`63f0e640 00007ff9`e294aa3e clr!CallDescrWorkerInternal + 0x83
    0d 00000060`63f0e680 00007ff9`e294b1d2 clr!CallDescrWorkerWithHandler + 0x4a
    0e 00000060`63f0e6c0 00007ff9`e2a98199 clr!MethodDescCallSite::CallTargetWorker + 0x251
    0f(Inline Function)--------`--------clr!MethodDescCallSite::Call + 0xc
    10 00000060`63f0e870 00007ff9`e294c481 clr!ThreadNative::KickOffThread_Worker + 0x105
    11 00000060`63f0ead0 00007ff9`e294c408 clr!ManagedThreadBase_DispatchInner + 0x2d
    12 00000060`63f0eb10 00007ff9`e294c379 clr!ManagedThreadBase_DispatchMiddle + 0x6c
    13 00000060`63f0ec10 00007ff9`e294c4bb clr!ManagedThreadBase_DispatchOuter + 0x75
    14 00000060`63f0eca0 00007ff9`e2a9807a clr!ManagedThreadBase_FullTransitionWithAD + 0x2f
    15(Inline Function)--------`--------clr!ManagedThreadBase::KickOff + 0x20
    16 00000060`63f0ed00 00007ff9`e2a07de6 clr!ThreadNative::KickOffThread + 0xd2
    17 00000060`63f0edd0 00007ff9`e91713d2 clr!Thread::intermediateThreadProc + 0x7d
    18 00000060`63f0f910 00007ff9`eb6554e4 kernel32!BaseThreadInitThunk + 0x22
    19 00000060`63f0f940 00000000`00000000 ntdll!RtlUserThreadStart + 0x34

and

    34  Id: 235c.1fa8 Suspend: 0 Teb: 00007ff5`ff4a2000 Unfrozen
    Child - SP          RetAddr Call Site
    00 00000060`63ecded8 00007ff9`e8ba13ed ntdll!ZwWaitForMultipleObjects + 0xa
    01 00000060`63ecdee0 00007ff9`e2b1f7da KERNELBASE!WaitForMultipleObjectsEx + 0xe1
    02 00000060`63ece1c0 00007ff9`e2b1f5e6 clr!WaitForMultipleObjectsEx_SO_TOLERANT + 0x62
    03(Inline Function)--------`--------clr!Thread::DoAppropriateAptStateWait + 0x45
    04 00000060`63ece220 00007ff9`e2b1f3fd clr!Thread::DoAppropriateWaitWorker + 0x1d0
    05 00000060`63ece320 00007ff9`e2b1f6a9 clr!Thread::DoAppropriateWait + 0x7d
    06 00000060`63ece3a0 00007ff9`e2a991f3 clr!CLREventBase::WaitEx + 0xc1
    07(Inline Function)--------`--------clr!CLREventBase::Wait + 0x1f
    08(Inline Function)--------`--------clr!Thread::Wait + 0x1f
    09 00000060`63ece430 00007ff9`e2a991c1 clr!Thread::Block + 0x27
    0a 00000060`63ece460 00007ff9`e2a98f81 clr!SyncBlock::Wait + 0x170
    0b(Inline Function)--------`--------clr!ObjHeader::Wait + 0x25
    0c(Inline Function)--------`--------clr!Object::Wait + 0x25
    0d 00000060`63ece5a0 00007ff9`84356843 clr!ObjectNative::WaitTimeout + 0xdd
    0e 00000060`63ece720 00007ff9`84356625 StackExchange_Redis!StackExchange.Redis.SocketManager.WriteAllQueues() + 0x83
    0f 00000060`63ece790 00007ff9`e13d2d45 StackExchange_Redis!StackExchange.Redis.SocketManager +<> c.<.cctor > b__49_0(System.Object) + 0x35
    10 00000060`63ece7d0 00007ff9`e13d2ab9 mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) + 0x285
    11 00000060`63ece930 00007ff9`e13d2a97 mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) + 0x9
    12 00000060`63ece960 00007ff9`e1ca3a8d mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) + 0x57
    13 00000060`63ece9b0 00007ff9`e294ab53 mscorlib_ni!System.Threading.ThreadHelper.ThreadStart(System.Object) + 0x5d
    14 00000060`63ecea00 00007ff9`e294aa3e clr!CallDescrWorkerInternal + 0x83
    15 00000060`63ecea40 00007ff9`e294b1d2 clr!CallDescrWorkerWithHandler + 0x4a
    16 00000060`63ecea80 00007ff9`e2a98199 clr!MethodDescCallSite::CallTargetWorker + 0x251
    17(Inline Function)--------`--------clr!MethodDescCallSite::Call + 0xc
    18 00000060`63ecec30 00007ff9`e294c481 clr!ThreadNative::KickOffThread_Worker + 0x105
    19 00000060`63ecee90 00007ff9`e294c408 clr!ManagedThreadBase_DispatchInner + 0x2d
    1a 00000060`63eceed0 00007ff9`e294c379 clr!ManagedThreadBase_DispatchMiddle + 0x6c
    1b 00000060`63ecefd0 00007ff9`e294c4bb clr!ManagedThreadBase_DispatchOuter + 0x75
    1c 00000060`63ecf060 00007ff9`e2a9807a clr!ManagedThreadBase_FullTransitionWithAD + 0x2f
    1d(Inline Function)--------`--------clr!ManagedThreadBase::KickOff + 0x20
    1e 00000060`63ecf0c0 00007ff9`e2a07de6 clr!ThreadNative::KickOffThread + 0xd2
    1f 00000060`63ecf190 00007ff9`e91713d2 clr!Thread::intermediateThreadProc + 0x7d
    20 00000060`63ecfc50 00007ff9`eb6554e4 kernel32!BaseThreadInitThunk + 0x22
    21 00000060`63ecfc80 00000000`00000000 ntdll!RtlUserThreadStart + 0x34

As @galiniliev suggested, this seems to be a problem with the mixing of sync and async code so in an attempt to avoid this issue we have tried telling the TPL to not schedule the continuation of the "StringGetAsync(key)" task on the same thread by updating our code to call it with ".ConfigureAwait(false)"; unfortunately even after this we are still hitting this issue.

Is there tracing we could turn on to help you get to the bottom of this? Is there anything you'd recommend us to do to work around the problem?

Any help would be greatly appreciated. Thanks

NickCraver commented 7 years ago

I'm curious - why does everyone here think this is a sync/async issue? The thread repeatedly posted here is another thread servicing the network socket. Notice how your async method isn't in the call stack? That's because it doesn't call into synchronous code. It simply adds to the transmit queue and returns. Your await will be fulfilled when a message returns from redis, when we complte the original task, ultimately via TryComplete() on the original task here.

So far, all the symptoms posted simply point to a response from redis taking time. Does anyone have numbers for latency, ops/etc, etc. to their redis instance?

@GePerez

we have tried telling the TPL to not schedule the continuation of the StringGetAsync(key) task on the same thread by updating our code to call it with .ConfigureAwait(false)

I'd like to clarify for this and others: that's not what .ConfigureAwait(false) does. It doesn't affect which thread it continues on, that's your continuation options (which are often just suggestions). When calling .ConfigureAwait(false), what you're saying is don't capture the synchronization context (to be more lightweight, having less to restore upon continuation as well).

GePerez commented 7 years ago

Thank you for the quick response @NickCraver. I'm not sure I understand this statement:

It simply adds to the transmit queue and returns. Your await will be fulfilled when a message returns from redis.

Our code retrieves an item from the cache by calling

string serializedObject = await ConnectionMultiplexer.GetDatabase().StringGetAsync(key);

If the above does not somehow create a thread with the stack:

34  Id: 235c.1fa8 Suspend: 0 Teb: 00007ff5`ff4a2000 Unfrozen
Child - SP  RetAddr Call Site
00 00000060`63ecded8 00007ff9`e8ba13ed ntdll!ZwWaitForMultipleObjects + 0xa
01 00000060`63ecdee0 00007ff9`e2b1f7da KERNELBASE!WaitForMultipleObjectsEx + 0xe1
02 00000060`63ece1c0 00007ff9`e2b1f5e6 clr!WaitForMultipleObjectsEx_SO_TOLERANT + 0x62
03(Inline Function)--------`--------clr!Thread::DoAppropriateAptStateWait + 0x45
04 00000060`63ece220 00007ff9`e2b1f3fd clr!Thread::DoAppropriateWaitWorker + 0x1d0
05 00000060`63ece320 00007ff9`e2b1f6a9 clr!Thread::DoAppropriateWait + 0x7d
06 00000060`63ece3a0 00007ff9`e2a991f3 clr!CLREventBase::WaitEx + 0xc1
07(Inline Function)--------`--------clr!CLREventBase::Wait + 0x1f
08(Inline Function)--------`--------clr!Thread::Wait + 0x1f
09 00000060`63ece430 00007ff9`e2a991c1 clr!Thread::Block + 0x27
0a 00000060`63ece460 00007ff9`e2a98f81 clr!SyncBlock::Wait + 0x170
0b(Inline Function)--------`--------clr!ObjHeader::Wait + 0x25
0c(Inline Function)--------`--------clr!Object::Wait + 0x25
0d 00000060`63ece5a0 00007ff9`84356843 clr!ObjectNative::WaitTimeout + 0xdd
0e 00000060`63ece720 00007ff9`84356625 StackExchange_Redis!StackExchange.Redis.SocketManager.WriteAllQueues() + 0x83
0f 00000060`63ece790 00007ff9`e13d2d45 StackExchange_Redis!StackExchange.Redis.SocketManager +<> c.<.cctor > b__49_0(System.Object) + 0x35

What usage of Redis do you recommend us to look for in our code to determine the source of all those threads?

With regards to this:

So far, all the symptoms posted simply point to a response from redis taking time

Do you have any recommendations for how to configure the Redis client side to timeout after some time?

Today, the call to StringGetAsync is wrapped around a TimeoutAfter API as described at https://blogs.msdn.microsoft.com/pfxteam/2011/11/10/crafting-a-task-timeoutafter-method

Unfortunately that does not seem to work which is why we started suspecting some sort of deadlock. Thanks a lot for your help.

NickCraver commented 7 years ago

@GePerez perhaps I misunderstood, if you're actually seeing hundreds of those threads, I need to ask different questions to identify the problem.

How are you creating the multiplexer(s)? Are you sharing 1 for the application, or creating many?

niemyjski commented 7 years ago

@NickCraver sorry for the late response.. Seeing this exact behavior.... Yes, we are creating a single multiplexer (https://github.com/exceptionless/Exceptionless/blob/master/src/Exceptionless.Insulation/Bootstrapper.cs#L47) and we are cleaning up after ourselves. We've had multiple people do code reviews on our implementations and I'm not sure what we could be doing that would be causing this but definitely an issue that I've been unable to track down further than this.

https://github.com/FoundatioFx/Foundatio.Redis

niemyjski commented 7 years ago

Also, it only seems to happen in production and very very rarely in dev. So it's really really hard to track this down on my side. We are using azure premium redis and I haven't seen any metrics to tell me I'm seeing redis timeouts (via redsmin)..

mgravell commented 6 years ago

Closing; please see #871 (WriteAllQueues literally doesn't exist in 2.0)