dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.11k stars 4.7k forks source link

Deadlock in WinHttpResponseStream #17844

Closed ericeil closed 4 years ago

ericeil commented 8 years ago

While investigating dotnet/runtime#17744, I saw one System.Net.Http test run deadlock in some cancellation token registration management code.

One thread was unregistering from a cancellation token:

00 ntdll!ZwDelayExecution
01 KERNELBASE!SleepEx
02 CoreCLR!Thread::UserSleep
03 CoreCLR!ThreadNative::Sleep
04 System_Private_CoreLib_ni!System.Threading.Thread.Sleep(Int32)
05 System_Private_CoreLib_ni!System.Threading.SpinWait.SpinOnce()
06 System_Private_CoreLib_ni!System.Threading.CancellationTokenSource.WaitForCallbackToComplete(System.Threading.CancellationCallbackInfo)
07 system_net_http!System.Net.Http.WinHttpRequestState.DisposeCtrReadFromResponseStream()
08 system_net_http!System.Net.Http.WinHttpRequestCallback.OnRequestReadComplete(System.Net.Http.WinHttpRequestState, UInt32)
09 system_net_http!System.Net.Http.WinHttpRequestCallback.RequestCallback(IntPtr, System.Net.Http.WinHttpRequestState, UInt32, IntPtr, UInt32)
0a system_net_http!System.Net.Http.WinHttpRequestCallback.WinHttpCallback(IntPtr, IntPtr, UInt32, IntPtr, UInt32)
0b system_console!DomainBoundILStubClass.IL_STUB_ReversePInvoke(Int64, Int64, Int32, Int64, Int32)
0c CoreCLR!UMThunkStub
0d winhttp!HTTP_REQUEST_HANDLE_OBJECT::_SafeAppCallback
0e winhttp!HTTP_REQUEST_HANDLE_OBJECT::_ControlledAppCallback
0f winhttp!HTTP_REQUEST_HANDLE_OBJECT::IndicateCompletionStatusCommon
10 winhttp!HTTP_REQUEST_HANDLE_OBJECT::IndicateCompletionStatusInline
11 winhttp!WinHttpReadData
12 system_console!DomainBoundILStubClass.IL_STUB_PInvoke(SafeWinHttpHandle, IntPtr, UInt32, IntPtr)
13 system_net_http!System.Net.Http.WinHttpResponseStream+<>c__DisplayClass17_0.<ReadAsync>b__0(System.Threading.Tasks.Task`1<Int32>)
14 System_Private_CoreLib_ni!System.Threading.Tasks.Task.Execute()
15 System_Private_CoreLib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
16 System_Private_CoreLib_ni!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)
17 System_Private_CoreLib_ni!System.Threading.Tasks.Task.ExecuteEntry(Boolean)
18 System_Private_CoreLib_ni!System.Threading.ThreadPoolWorkQueue.Dispatch()
19 CoreCLR!CallDescrWorkerInternal
1a CoreCLR!MethodDescCallSite::CallTargetWorker
1b CoreCLR!MethodDescCallSite::Call_RetBool
1c CoreCLR!QueueUserWorkItemManagedCallback
1d CoreCLR!ManagedThreadBase_DispatchInner
1e CoreCLR!ManagedThreadBase_DispatchMiddle
1f CoreCLR!ManagedThreadBase_DispatchOuter
20 CoreCLR!ManagedThreadBase_FullTransitionWithAD
21 CoreCLR!ManagedThreadBase::ThreadPool
22 CoreCLR!ManagedPerAppDomainTPCount::DispatchWorkItem
23 CoreCLR!ThreadpoolMgr::ExecuteWorkRequest
24 CoreCLR!ThreadpoolMgr::WorkerThreadStart
25 CoreCLR!Thread::intermediateThreadProc
26 KERNEL32!BaseThreadInitThunk
27 ntdll!RtlUserThreadStart

So, CancellationTokenSource.WaitForCallbackToComplete is waiting for an already-executing registered callback. This appears to be running on the following stack:

00 ntdll!ZwWaitForMultipleObjects
01 KERNELBASE!WaitForMultipleObjectsEx
02 CoreCLR!WaitForMultipleObjectsEx_SO_TOLERANT
03 CoreCLR!Thread::DoAppropriateAptStateWait
04 CoreCLR!Thread::DoAppropriateWaitWorker
05 CoreCLR!Thread::DoAppropriateWait
06 CoreCLR!CLREventBase::WaitEx
07 CoreCLR!CLREventBase::Wait
08 CoreCLR!AwareLock::EnterEpilogHelper
09 CoreCLR!AwareLock::EnterEpilog
0a CoreCLR!SyncBlock::EnterMonitor
0b CoreCLR!ObjHeader::EnterObjMonitor
0c CoreCLR!Object::EnterObjMonitor
0d CoreCLR!JITutil_MonEnterWorker
0e system_net_http!System.Net.Http.WinHttpResponseStream.CancelPendingResponseStreamReadOperation()
0f system_net_http!System.Net.Http.WinHttpResponseStream+<>c.<ReadAsync>b__17_1(System.Object)
10 System_Private_CoreLib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
11 System_Private_CoreLib_ni!System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean)
12 System_Private_CoreLib_ni!System.Threading.CancellationTokenSource.Cancel()
13 system_net_http_functional_tests!System.Net.Http.Functional.Tests.ResponseStreamTest+<ReadAsStreamAsync_Cancel_TaskIsCanceled>d__5.MoveNext()
14 System_Private_CoreLib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
15 System_Private_CoreLib_ni!System.Runtime.CompilerServices.AsyncMethodBuilderCore+MoveNextRunner.RunWithDefaultContext()
16 xunit_execution_dotnet!Xunit.Sdk.AsyncTestSyncContext+<>c__DisplayClass7_0.<Post>b__1(System.Object)
17 xunit_execution_dotnet!Xunit.Sdk.MaxConcurrencySyncContext.RunOnSyncContext(System.Threading.SendOrPostCallback, System.Object)
18 System_Private_CoreLib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
19 xunit_execution_dotnet!Xunit.Sdk.MaxConcurrencySyncContext.WorkerThreadProc()
1a xunit_execution_dotnet!Xunit.Sdk.XunitWorkerThread+<>c.<QueueUserWorkItem>b__5_0(System.Object)
1b System_Private_CoreLib_ni!System.Threading.Tasks.Task.Execute()
1c System_Private_CoreLib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
1d System_Private_CoreLib_ni!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)
1e System_Private_CoreLib_ni!System.Threading.Tasks.Task.ExecuteEntry(Boolean)
1f System_Private_CoreLib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
20 CoreCLR!CallDescrWorkerInternal
21 CoreCLR!MethodDescCallSite::CallTargetWorker
22 CoreCLR!MethodDescCallSite::Call
23 CoreCLR!ThreadNative::KickOffThread_Worker
24 CoreCLR!ManagedThreadBase_DispatchInner
25 CoreCLR!ManagedThreadBase_DispatchMiddle
26 CoreCLR!ManagedThreadBase_DispatchOuter
27 CoreCLR!ManagedThreadBase_FullTransitionWithAD
28 CoreCLR!ManagedThreadBase::KickOff
29 CoreCLR!ThreadNative::KickOffThread
2a CoreCLR!Thread::intermediateThreadProc
2b KERNEL32!BaseThreadInitThunk
2c ntdll!RtlUserThreadStart

WinHttpResponseStream.CancelPendingResponseStreamReadOperation is trying to acquire a lock, which is already held by the first thread - the one waiting for the cancellation callback to complete. So, classic deadlock.

I can't reproduce this reliably, but I found it while running System.Net.Http.Functional.Tests under WinDBG, which may alter timing enough to make this more likely to repro.

davidsh commented 8 years ago

cc: @stephentoub @CIPop

EmmaZhu commented 7 years ago

Has this been fixed? If yes, could you share the package version?

davidsh commented 7 years ago

Has this been fixed? If yes, could you share the package version?

This issue is still open. It has not been fixed yet.

karelz commented 7 years ago

Next step:

baal2000 commented 7 years ago

@karelz @davidsh

Comment: System.Net.Http + System.Net.Http.WinHttpHandler NuGet v. 4.3.0 - based custom server runs out of IO Completion threads intermittently (at least once a week) due to this exact issue. At the point of failure, the process dump contains 1500+ threads in WinHttpRequestState.DisposeCtrReadFromResponseStream() along with a similar count of threads in WinHttpResponseStream.CancelPendingResponseStreamReadOperation().

In particular

Question: WinHttpResponseStream.ReadAsync and WinHttpRequestState implementation in master branch is different from release/1.1.0. Has the deadlock been addressed there?

neilcawse commented 7 years ago

This issue is impacting us in production. Any meaningful systems with lots of async connections are likely to also be impacted. Case has been open 8 months, a long time if dotnet core is being pushed for use in serious systems. If you are stuck maybe we can try help build a repo.does seem to be load related.

davidsh commented 7 years ago

Please submit a repro and/or memory dump. Otherwise, it will be very difficult to diagnose. Thx.

baal2000 commented 7 years ago

@davidsh @karelz

Please follow up @EmmaZhu in the reference to [CoreCLR][KnownIssue]Hang in when cancelling a request due a dead lock in .Net Core framework dotnet/corefx#47.

baal2000 commented 7 years ago

@davidsh

Might help: the parsed dump showing all active threads with call stack and thread Id at the moment of the process failure.

Let me know if there are any questions.

karelz commented 7 years ago

@baal2000 your reference to @EmmaZhu does not contain any repro or signs of ability to get repro. Or am I missing something? Let's see if your parsed dump helps. From its brief inspection I have a suspicion that it won't be enough for us to root-cause - @davidsh can you please take a look? If it doesn't, we might need (heap) dump (or even better a repro) to have a chance to make further progress (and even that is not guarantee it will be sufficient).

@neilcawse while the issue is opened for 8 months, it had only 2 hits until a week ago (moreover one of them only under debugger, i.e. it's suspicious/uncertain). Given that there was no repro or any sign of high impact under load, it so no big surprise that nobody focused on it so far as it was not actionable. With the additional info from you and @baal2000 in last 2 days, we now understand that it has non-trivial impact. So the next step is to get a repro and/or full dump (if the info above isn't sufficient).

To get a repro we might try to stress the code path and/or leverage the tests that hit it - the original one from @ericeil or dotnet/runtime#17744 (which may have or may not have been a dupe).

cc @stephentoub

davidsh commented 7 years ago

Let's see if your parsed dump helps. From its brief inspection I have a suspicion that it won't be enough for us to root-cause - @davidsh can you please take a look? If it doesn't, we might need (heap) dump (or even better a repro) to have a chance to make further progress (and even that is not guarantee it will be sufficient).

I would need a heap dump (*.dmp file ) and/or a repro to diagnose this.

karelz commented 7 years ago

@baal2000 @neilcawse @EmmaZhu is it something that you can provide? Otherwise, we are stuck I am afraid :(

baal2000 commented 7 years ago

@karelz @davidsh

Not exactly a repro but a summary of what our code does to get into the trouble state:

The significant thread players are:

  1. Suspected Async http request thread:
System.Threading.Thread.Sleep(Int32)
System.Threading.SpinWait.SpinOnce()
**System.Threading.CancellationTokenSource.WaitForCallbackToComplete**(System.Threading.CancellationCallbackInfo)
System.Threading.CancellationTokenRegistration.Dispose()
System.Net.Http.WinHttpRequestState.DisposeCtrReadFromResponseStream()
System.Net.Http.WinHttpRequestCallback.OnRequestReadComplete(System.Net.Http.WinHttpRequestState, UInt32)
System.Net.Http.WinHttpRequestCallback.RequestCallback(IntPtr, System.Net.Http.WinHttpRequestState, UInt32, IntPtr, UInt32)
System.Net.Http.WinHttpRequestCallback.WinHttpCallback(IntPtr, IntPtr, UInt32, IntPtr, UInt32)
DomainBoundILStubClass.IL_STUB_ReversePInvoke(Int64, Int64, Int32, Int64, Int32)
DomainBoundILStubClass.IL_STUB_PInvoke(SafeWinHttpHandle, IntPtr, UInt32, IntPtr)
System.Net.Http.WinHttpResponseStream+<>c__DisplayClass17_0.<ReadAsync>b__0(System.Threading.Tasks.Task`1<Int32>)
System.Threading.Tasks.Task.Execute()
System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)
System.Threading.Tasks.Task.ExecuteEntry(Boolean)
System.Threading.ThreadPoolWorkQueue.Dispatch()
  1. Async task IO completion thread with some low-level TPL-related scaffolding removed (can be found inside the full parsed dump):
**System.Net.Http.WinHttpResponseStream.CancelPendingResponseStreamReadOperation()**
System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
System.Threading.CancellationCallbackInfo.ExecuteCallback()
System.Threading.CancellationTokenSource.CancellationCallbackCoreWork(System.Threading.CancellationCallbackCoreWorkArguments)
System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean)
System.Threading.CancellationTokenSource.NotifyCancellation(Boolean)
System.Threading.CancellationTokenSource.LinkedTokenCancelDelegate(System.Object)
System.Threading.CancellationCallbackInfo.ExecuteCallback()
System.Threading.CancellationTokenSource.CancellationCallbackCoreWork(System.Threading.CancellationCallbackCoreWorkArguments)
System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean)
System.Threading.CancellationTokenSource.NotifyCancellation(Boolean)
System.Threading.CancellationTokenSource.LinkedTokenCancelDelegate(System.Object)
System.Threading.CancellationCallbackInfo.ExecuteCallback()
System.Threading.CancellationTokenSource.CancellationCallbackCoreWork(System.Threading.CancellationCallbackCoreWorkArguments)
System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean)
System.Threading.CancellationTokenSource.NotifyCancellation(Boolean)
System.Threading.CancellationTokenSource.Cancel()
FooCo.TaskHelper.GetAnyTaskResult[[System.__Canon, mscorlib]](System.__Canon, System.Collections.Generic.IList`1<System.Threading.Tasks.Task`1<System.__Canon>>, Int32, System.Action, System.Action`1<System.Func`1<System.String>>)
FooCo.TaskHelper.GetAnyTaskResult[[System.__Canon, mscorlib]](System.__Canon, System.Collections.Generic.IList`1<System.Threading.Tasks.Task`1<System.__Canon>>, Int32, System.Action, System.Action`1<System.Func`1<System.String>>)
FooCo.CustomLib.Web.WebMethodsServers.FindServer(System.String, System.String)
FooCo.CustomLib.Web.WebMethodsServers.FindLoginInFederation(System.String, System.String, Boolean, System.Func`3<FooCo.CustomLib.Web.WebServerInvoker,System.Threading.CancellationToken,System.Threading.Tasks.Task`1<FooCo.CustomLib.ObjectModel.LoginResu
FooCo.CustomLib.Web.WebMethodsServers.LoginGlobally(System.String, System.String, System.String, System.String, System.String)
FooCo.CustomLib.Web.WebMethods.Authenticate(FooCo.CustomLib.Web.RequestResponse, System.String, System.String, System.String, System.String, Boolean, Boolean)
System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(System.Object, System.Object[], System.Object[])
System.Reflection.RuntimeMethodInfo.Invoke(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo)
FooCo.CustomLib.Web.APIV1Base+<>c__DisplayClass18_0.<GetMethodResult>b__1()
FooCo.CustomLib.Web.APIV1Base.GetMethodResult(FooCo.WebServer.IHttpRequest, FooCo.WebServer.HttpResponse, FooCo.CustomLib.ObjectModel.Credentials, Newtonsoft.Json.Linq.JObject, System.Reflection.MethodInfo, System.Object, FooCo.CustomLib.IProfiler, S
FooCo.CustomLib.Web.APIV1.ProcessRequest(FooCo.WebServer.IHttpRequest, FooCo.CustomLib.Web.APIV1Base.Process(FooCo.WebServer.IHttpRequest, FooCo.WebServer.HttpResponse)
FooCo.CustomLib.CustomLibWebServerProcessConnection+<ProcessConnectionAsync>d__25.MoveNext()
FooCo.WebServer.HttpRequest+<Parse>d__61.MoveNext()
FooCo.WebServer.HttpRequest+<TryParseRequestInternal>d__83.MoveNext()
FooCo.WebServer.HttpRequest+<ReadAllHeaders>d__72.MoveNext()
FooCo.WebServer.ConnectionStream+<ReadAsync>d__36.MoveNext()
FooCo.Tasks.TaskExtension+<TryWaitAsync>d__5`1[[System.Int32, mscorlib]].MoveNext()
FooCo.Tasks.TaskExtension+<TryWaitAsync>d__0.MoveNext()
FooCo.Tasks.TaskExtension+<ExecuteTimedTaskAsync>d__25`1[[System.__Canon, mscorlib]].MoveNext()
System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
System.Threading.Tasks.Task.FinishContinuations()
System.Threading.Tasks.TaskFactory`1+FromAsyncTrimPromise`1[[System.Int32, mscorlib],[System.__Canon, mscorlib]].CompleteFromAsyncResult(System.IAsyncResult)
System.Net.LazyAsyncResult.Complete(IntPtr)
System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
System.Net.ContextAwareResult.Complete(IntPtr)
System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr)
System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
baal2000 commented 7 years ago

Just a quick glance at the code related to disposing of WinHttpResponseStream._requestHandle field: is disposing of _requestHandle field done consistently?

While volatile reassignment is dangerous, Dispose might not. The base SafeHandle looks like a thread safe method by design: msdn.microsoft.com

"... Dispose ... allows the resources to be freed. This might not happen immediately if other threads are using the same instance of the safe handle, but will happen as soon as that is no longer the case."

baal2000 commented 7 years ago

@davidsh FYI: We could not reproduce the deadlock in a test environment. We could not provide with the dump file either, as it contains production data. As a workaround, we refactored our code to reduce Task nesting and avoid sharing of cancellation tokens between nested tasks (with HttpClient.PostAsync being the inner-most Task). The workaround has not yet been tested in production.

baal2000 commented 7 years ago

On a side note:

@davidsh

Take another look at the infinite loop in CancellationTokenSource.WaitForCallbackToComplete(), please. The logic assumes a callback is fast in the best case scenario and never blocks indefinitely in the worst case. That is proven to be an incorrect assumption, even without being able to reproduce. Making the code to timeout or being able to detect the deadlock some other way would help uncovering potential silent deadlock failures.

            while (ExecutingCallback == callbackInfo)
            {
                sw.SpinOnce();  //spin as we assume callback execution is fast and that this situation is rare.
            }

Thanks!

caesar-chen commented 7 years ago

TLDR: This deadlock problem has been solved

First, the higher level representation of the problem:

Thread A: lock (_state.Lock) { if (!Interop.WinHttp.WinHttpReadData(…) {} -> complete read, immediately dispose the CancellationTokenRegistration() object -> state.DisposeCtrReadFromResponseStream(); -> wait for a callback to complete -> while (ExecutingCallback == callbackinfo) {spin()} }

Thread B: Executing the callback, need to acquire the lock to complete. The lock is held by thread A

private void CancelPendingResponseStreamReadOperation() { lock (_state.Lock) {…} }

Use WinDBG to look into the call stack of Thread A(~10s) and B(~6s) to prove the issue:

Thread A (~10s): Dispose the registration & Unregister the callback -> the target callback is the callback executing in Thread B; wait for it to complete.

0000002af5ffe440 00007ffb838f7950 System.Threading.SpinWait.SpinOnce() PARAMETERS: this () = 0x0000002af5ffe4a0 LOCALS:

0000002af5ffe480 00007ffb83929e09 System.Threading.CancellationTokenSource.WaitForCallbackToComplete(System.Threading.CancellationCallbackInfo) PARAMETERS: this () = **0x000001c3add73888 (The CancellationTokenSource)** callbackInfo () = **0x000001c3ade01cf0 (Callback info)** LOCALS: 0000002af5ffe4c0 00007ffb253d7a82 System.Net.Http.WinHttpRequestState.**DisposeCtrReadFromResponseStream()** [c:\src\corefx\src\System.Net.Http.WinHttpHandler\src\System\Net\Http\WinHttpRequestState.cs @ 192] PARAMETERS: this (0x0000002af5ffe550) = 0x000001c3add73db0 LOCALS: 0x0000002af5ffe518 = 0x000001c3ade01cf0 ^ | | triger callback | 0000002af5ffeb80 00007ffb253d7539 System.Net.Http.WinHttpResponseStream+c__DisplayClass17_0.b__0(System.Threading.Tasks.Task`1) [c:\src\corefx\src\System.Net.Http.WinHttpHandler\src\System\Net\Http\WinHttpResponseStream.cs @ 156] PARAMETERS: this (0x0000002af5ffec90) = 0x000001c3ade01b50 previousTask (0x0000002af5ffec98) = 0x000001c3ade01bf8 LOCALS: 0x0000002af5ffec6c = 0x0000000000000000 0x0000002af5ffec68 = 0x0000000000000000 0x0000002af5ffec64 = 0x0000000000000146 0x0000002af5ffec60 = 0x0000000000000146 0x0000002af5ffec5c = 0x0000000000000000 0x0000002af5ffec10 = **0x000001c3add73ea8 (acquire the lock)** 0x0000002af5ffec50 = 0x0000000000000001 0x0000002af5ffec4c = 0x0000000000000000 ### Analysis of Thread A: 0:010> !DumpObj /d **000001c3add73888** Name: System.Threading.CancellationTokenSource MethodTable: 00007ffb83bae2a0 EEClass: 00007ffb833eef20 Size: 64(0x40) bytes File: c:\src\corefx\bin\tests\Windows_NT.AnyCPU.Debug\System.Net.Http.Functional.Tests\netcoreapp1.0\System.Private.CoreLib.ni.dll Fields: MT Field Offset Type VT Attr Value Name 00007ffb83bb0658 4000f8b 8 ....ManualResetEvent 0 instance 0000000000000000 m_kernelEvent 00007ffb83b8dd88 4000f8c 10 ...ivate.CoreLib]][] 0 instance 000001c3ade01d20 m_registeredCallbacksLists 00007ffb83bb1bd0 4000f91 28 System.Int32 1 instance 2 m_state 00007ffb83bb1bd0 4000f92 2c System.Int32 1 instance 6 m_threadIDExecutingCallbacks 00007ffb83b9b8a0 4000f93 30 System.Boolean 1 instance 0 m_disposed 00007ffb83bb14a8 4000f94 18 ...ationCallbackInfo 0 instance **000001c3ade01cf0 m_executingCallback** 00007ffb83babb60 4000f95 20 ...m.Threading.Timer 0 instance 0000000000000000 m_timer 00007ffb83bae2a0 4000f88 780 ...lationTokenSource 0 shared static _staticSource_Set >> Domain:Value 000001c3abc73ea0:NotInit << 00007ffb83bae2a0 4000f89 788 ...lationTokenSource 0 shared static _staticSource_NotCancelable >> Domain:Value 000001c3abc73ea0:NotInit << 00007ffb83bb1bd0 4000f8a 818 System.Int32 1 shared static s_nLists >> Domain:Value 000001c3abc73ea0:NotInit << 00007ffb83b56838 4000f96 790 ...ing.TimerCallback 0 shared static s_timerCallback >> Domain:Value 000001c3abc73ea0:NotInit << Besides, **this.ExecutingCallback == callbackInfo** will always true (Because Thread B cannot complete the callback, since it needs the **lock** to continue), and While (ExecutingCallback==callbackInfo) { sw.SpinOnce(); } Will hang. ### Thread B (~6s): executing a callback 0000002af5bfe890 00007ffb254798f2 System.Net.Http.WinHttpResponseStream.CancelPendingResponseStreamReadOperation() [c:\src\corefx\src\System.Net.Http.WinHttpHandler\src\System\Net\Http\WinHttpResponseStream.cs @ 257] PARAMETERS: this (0x0000002af5bfe980) = 0x000001c3addff5a0 LOCALS: 0x0000002af5bfe938 = **0x000001c3add73ea8 (need to get this lock to continue)** 0x0000002af5bfe958 = 0x0000000000000000 0x0000002af5bfe954 = 0x0000000000000000 0000002af5bfe980 00007ffb2547984e System.Net.Http.WinHttpResponseStream+c.b__17_1(System.Object) [c:\src\corefx\src\System.Net.Http.WinHttpHandler\src\System\Net\Http\WinHttpResponseStream.cs @ 176] PARAMETERS: this (0x0000002af5bfe9c0) = 0x000001c3ad955ea0 s (0x0000002af5bfe9c8) = 0x000001c3addff5a0 . . . 0000002af5bfea30 00007ffb83929b28 System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean) PARAMETERS: this (0x0000002af5bfead0) = **0x000001c3add73888 (The CancellationTokenSource)** throwOnFirstException (0x0000002af5bfead8) = 0x0000000000000000 LOCALS: 0x0000002af5bfea70 = 0x0000000000000000 0x0000002af5bfea68 = 0x000001c3ade01d20 0x0000002af5bfea8c = 0x0000000000000002 0x0000002af5bfea60 = 0x000001c3ade01d78 0x0000002af5bfea88 = 0x0000000000000000 ### Analysis of Thread B: **000001c3add73ea8: is the lock** 0:006> !DumpObj /d 000001c3add73ea8 Name: System.Object MethodTable: 00007ffb83b99e68 EEClass: 00007ffb833e60d8 Size: 24(0x18) bytes File: c:\src\corefx\bin\tests\Windows_NT.AnyCPU.Debug\System.Net.Http.Functional.Tests\netcoreapp1.0\System.Private.CoreLib.ni.dll Object Fields: None ## Explanation: Look into the new code, this block https://github.com/dotnet/corefx/blob/master/src/System.Net.Http.WinHttpHandler/src/System/Net/Http/WinHttpResponseStream.cs#L229 is still wrapped with the lock. But when read complete, we don't dispose the CancellationTokenRegistration() object right away: https://github.com/dotnet/corefx/commit/48214d726de9e1a150fe5b041832c41655b69815#diff-360813f86b82b20371773665ca498ac4L152 Instead, we dispose it at the very late finally {} block, which has no lock around it, causes no trouble. Old approach: cts = token.Register(), dispose immediately when read completes https://github.com/dotnet/corefx/commit/48214d726de9e1a150fe5b041832c41655b69815#diff-3c168e775481d5df21012e1f7377b077L175 Now: Dispose at the very end with no critical section and no race condition https://github.com/dotnet/corefx/commit/48214d726de9e1a150fe5b041832c41655b69815#diff-3c168e775481d5df21012e1f7377b077R238
karelz commented 7 years ago

@schen381 I am the TL;DR crowd: So how was it solved? Did PRs in the meantime (the 2 you referenced?) fix the problem?

caesar-chen commented 7 years ago

@karelz Yes, this PR: https://github.com/dotnet/corefx/pull/13631 solves the problem. This is the specific commit https://github.com/dotnet/corefx/pull/13631/commits/48214d726de9e1a150fe5b041832c41655b69815 that helps.

karelz commented 7 years ago

@schen381 thanks! That's the information I was looking for -- which change fixed it as a side-effect :).

karelz commented 7 years ago

cc @stephentoub

cjmairair commented 7 years ago

I didn't read everything in detail, but I have a similar problem. I don't know if it's the same. And it boils down to me thinking this:

WaitForCallbackToComplete inside CancellationTokenRegistration is just not right.

You don't just do a wait secretly, without the API user knowing about it. And this is the wrong place to wait.

Just this simple code will cause a deadlock.

 var cancelReg = cancelToken.Register(() =>
                    {
                        // maybe dispose stuff...
                        cancelReg.Dispose();
                    });
`

Because you are disposing the CancellationTokenRegistration thing inside the callback. But this is a normal thing to do.

This just shouldn't happen. cancelReg.Dispose just needs to detach the callback even if it is running.

karelz commented 7 years ago

We believe it is fixed in .NET Core 2.0. If you have similar problem on .NET Core 2.0, please file a new issue. We generally don't monitor old ones ...