dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.94k stars 4.64k forks source link

Test suite hangs after upgrading to 2.1.0-rc1 #26204

Closed ppekrol closed 4 years ago

ppekrol commented 6 years ago

Hi all,

after upgrading our repository from 2.1.0-preview2-26406-04 to 2.1.0-rc1 (and later on to 2.1.0-rtm-26516-03) we started to experience test suite hangs.

This is the result of dotnet --info on one of our CI (Jenkins) agents:

.NET Core SDK (reflecting any global.json):
 Version:   2.1.300-rc1-008673
 Commit:    f5e3ddbe73

Runtime Environment:
 OS Name:     Windows
 OS Version:  6.3.9600
 OS Platform: Windows
 RID:         win81-x64
 Base Path:   C:\Program Files\dotnet\sdk\2.1.300-rc1-008673\

Host (useful for support):
  Version: 2.1.0-rtm-26516-03
  Commit:  caa7b7e2ba

.NET Core SDKs installed:
  1.0.0-preview1-002702 [C:\Program Files\dotnet\sdk]
  1.0.0-preview2-003121 [C:\Program Files\dotnet\sdk]
  1.0.0-preview2-003131 [C:\Program Files\dotnet\sdk]
  1.0.0-preview2-1-003177 [C:\Program Files\dotnet\sdk]
  1.0.0 [C:\Program Files\dotnet\sdk]
  1.0.1 [C:\Program Files\dotnet\sdk]
  1.0.4 [C:\Program Files\dotnet\sdk]
  2.0.0 [C:\Program Files\dotnet\sdk]
  2.0.2 [C:\Program Files\dotnet\sdk]
  2.0.3 [C:\Program Files\dotnet\sdk]
  2.1.3 [C:\Program Files\dotnet\sdk]
  2.1.4 [C:\Program Files\dotnet\sdk]
  2.1.100 [C:\Program Files\dotnet\sdk]
  2.1.101 [C:\Program Files\dotnet\sdk]
  2.1.104 [C:\Program Files\dotnet\sdk]
  2.1.105 [C:\Program Files\dotnet\sdk]
  2.1.200 [C:\Program Files\dotnet\sdk]
  2.1.300-preview2-008533 [C:\Program Files\dotnet\sdk]
  2.1.300-rc1-008673 [C:\Program Files\dotnet\sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.0-preview2-final [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.1.0-rc1-final [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.0-preview2-final [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.1.0-rc1-final [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 1.0.0-rc2-3002702 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 1.0.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 1.0.1 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 1.0.4 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 1.0.5 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 1.1.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 1.1.1 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 1.1.2 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.4 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.5 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.6 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.7 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.0-preview2-26406-04 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.0-rc1 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.0-rtm-26515-03 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.0-rtm-26516-03 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]

To install additional .NET Core runtimes or SDKs:
  https://aka.ms/dotnet-download

Nothing in our code except the runtime and package upgrade was changed.

The problem is that we are experiencing this only on Windows machines (WS2012R2, WS2016, W10)... and most often it happens on the CI machines which are less powerful than the development ones (Hyper-V hosted).

Worth noting is that the test suite is passing (not hanging) on Linux (Ubuntu 16.04), so this might be related to the Windows implementation.

The hang is observed in https://github.com/ppekrol/ravendb/blob/f82e6e883311765082c1a5bc102298636aab0b53/src/Raven.Client/Util/AsyncHelpers.cs#L176 which we find baffling since this code has not changed in a while and we do not see a way that the state when _done = false, _items.IsEmpty = true and _workItemsWaiting is not set is possible.

Were there any changes between 2.1.0-preview2-26406-04 and 2.1.0-rc1 that could cause that?

Steps to reproduce:

Unfortunately it hangs randomly so it can take couple of runs to reproduce (happens often on slower machines). The whole SlowTests suite takes ~30min on i7, 16GB RAM machine.

I've created a dump and uploaded in here: https://drive.google.com/file/d/1epTMzfVOQj2U6A95NMxyPIKcRsa2WOQx/view?usp=sharing

The stack that I'm talking about:

    [External Code] 
    [Managed to Native Transition]  
>   Raven.Client.dll!Raven.Client.Util.AsyncHelpers.ExclusiveSynchronizationContext.BeginMessageLoop() Line 176 C#
    Raven.Client.dll!Raven.Client.Util.AsyncHelpers.RunSync(System.Func<System.Threading.Tasks.Task> task) Line 63  C#
    Raven.Client.dll!Raven.Client.Documents.Session.DocumentSession.SaveChanges() Line 90   C#
    SlowTests.dll!SlowTests.SlowTests.Issues.RavenDB_1280.Referenced_Docs_Are_Indexed_During_Heavy_Writing.AnonymousMethod__0(int i) Line 29    C#
    System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.Parallel.ForWorker.AnonymousMethod__1(ref System.Threading.Tasks.RangeWorker currentWorker, int timeout, out bool replicationDelegateYieldedBeforeCompletion)    Unknown
    System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.TaskReplicator.Replica<System.Threading.Tasks.RangeWorker>.ExecuteAction(out bool yieldedBeforeCompletion)   Unknown
    System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.TaskReplicator.Replica.Execute() Unknown
    System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.TaskReplicator.Replica..ctor.AnonymousMethod__4_0(object s)  Unknown
    System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)   Unknown
    System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(ref System.Threading.Tasks.Task currentTaskSlot)  Unknown
    System.Private.CoreLib.dll!System.Threading.Tasks.ThreadPoolTaskScheduler.TryExecuteTaskInline(System.Threading.Tasks.Task task, bool taskWasPreviouslyQueued)  Unknown
    System.Private.CoreLib.dll!System.Threading.Tasks.TaskScheduler.TryRunInline(System.Threading.Tasks.Task task, bool taskWasPreviouslyQueued)    Unknown
    System.Private.CoreLib.dll!System.Threading.Tasks.Task.InternalRunSynchronously(System.Threading.Tasks.TaskScheduler scheduler, bool waitForCompletion) Unknown
    System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.TaskReplicator.Run<System.Threading.Tasks.RangeWorker>(System.Threading.Tasks.TaskReplicator.ReplicatableUserAction<System.Threading.Tasks.RangeWorker> action, System.Threading.Tasks.ParallelOptions options, bool stopOnFirstFailure) Unknown
    System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.Parallel.ForWorker<object>(int fromInclusive, int toExclusive, System.Threading.Tasks.ParallelOptions parallelOptions, System.Action<int> body, System.Action<int, System.Threading.Tasks.ParallelLoopState> bodyWithState, System.Func<int, System.Threading.Tasks.ParallelLoopState, object, object> bodyWithLocal, System.Func<object> localInit, System.Action<object> localFinally) Unknown
    System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.Parallel.For(int fromInclusive, int toExclusive, System.Threading.Tasks.ParallelOptions parallelOptions, System.Action<int> body)    Unknown
    SlowTests.dll!SlowTests.SlowTests.Issues.RavenDB_1280.Referenced_Docs_Are_Indexed_During_Heavy_Writing() Line 41    C#
    [Native to Managed Transition]  
    [Managed to Native Transition]  
    xunit.execution.dotnet.dll!Xunit.Sdk.TestInvoker<System.__Canon>.CallTestMethod(object testClassInstance) Line 150  C#
    xunit.execution.dotnet.dll!Xunit.Sdk.TestInvoker<Xunit.Sdk.IXunitTestCase>.InvokeTestMethodAsync.AnonymousMethod__1() Line 257  C#
    xunit.execution.dotnet.dll!Xunit.Sdk.ExecutionTimer.AggregateAsync(System.Func<System.Threading.Tasks.Task> asyncAction) Line 48    C#
    xunit.core.dll!Xunit.Sdk.ExceptionAggregator.RunAsync(System.Func<System.Threading.Tasks.Task> code) Line 90    C#
    xunit.execution.dotnet.dll!Xunit.Sdk.TestInvoker<Xunit.Sdk.IXunitTestCase>.InvokeTestMethodAsync(object testClassInstance) Line 239 C#
    xunit.execution.dotnet.dll!Xunit.Sdk.TestInvoker<Xunit.Sdk.IXunitTestCase>.RunAsync.AnonymousMethod__47_0() Line 206    C#
    [External Code] 
    [Resuming Async Method] 
    System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)   Unknown
    System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder<decimal>.AsyncStateMachineBox<Xunit.Sdk.TestInvoker<Xunit.Sdk.IXunitTestCase>.<<RunAsync>b__47_0>d>.MoveNext()    Unknown
    System.Private.CoreLib.dll!System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox box, bool allowInlining)  Unknown
    System.Private.CoreLib.dll!System.Threading.Tasks.Task.RunContinuations(object continuationObject)  Unknown
    System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(ref System.Threading.Tasks.Task currentTaskSlot)  Unknown
    System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch()  Unknown
    [Native to Managed Transition]  
stephentoub commented 6 years ago

we do not see a way that the state when _done = false, _items.IsEmpty = true and _workItemsWaiting is not set is possible

Isn't that possible if you processed the last item in the queue, and the task from that item never completes?

ppekrol commented 6 years ago

@stephentoub True, but I would hope to see that in the Parallel Stacks then.

stephentoub commented 6 years ago

@stephentoub True, but I would hope to see that in the Parallel Stacks then.

Why? That would only show operations with active frames on threads... it's very easy to come up with cases where that's not the case.

ppekrol commented 6 years ago

In this particular case this would be a sudden change of behavior between 2.1.0-preview2-26406-04 (2.0 and 1.x also since the code is old) and 2.1.0-rc1.

stephentoub commented 6 years ago

Not necessarily. For example, a performance improvement or regression between the versions could expose a race condition in the code and cause a task that previously completed to no longer complete.

I suggest you gather more details about the hang, e.g. what task(s) were being awaited when this hangs.

ppekrol commented 6 years ago

I think that it is clearly visible in the dump. The test that in this particular run hanged was 'RavenDB_1280_ReOpen.Can_Index_With_Missing_LoadDocument_References' and it failed during saving the documents (HTTP POST) to the database (in parallel). We are using SyncSession there which is using AsyncHelpers.RunSync. And this is what hangs.

Please note that this does not hang on Linux.

stephentoub commented 6 years ago

And this is what hangs.

Right, so I'm suggesting the problem isn't the SynchronizationContext, but rather than you're observing the SyncCtx in that state because its message loop is still waiting for an asynchronous operation to complete. I opened your dump in windbg with sos, and found the relevant async state machine (or at least one such state machine... might not be the exact same one):

0:054> !DumpObj /d 00000087bb2c7970
Name:        System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[Raven.Client.Util.AsyncHelpers+<>c__DisplayClass2_1+<<RunSync>b__0>d, Raven.Client]]
MethodTable: 00007ffd41842ab0
EEClass:     00007ffd4183d758
Size:        136(0x88) bytes
File:        C:\Program Files\dotnet\shared\Microsoft.NETCore.App\2.1.0-rtm-26516-03\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffd9f501968  4000dbb       30         System.Int32  1 instance                0 m_taskId
00007ffd9f500ad8  4000dbc        8      System.Delegate  0 instance 0000000000000000 m_action
00007ffd9f500410  4000dbd       10        System.Object  0 instance 0000000000000000 m_stateObject
00007ffd9f509b58  4000dbe       18 ...sks.TaskScheduler  0 instance 0000000000000000 m_taskScheduler
00007ffd9f501968  4000dbf       34         System.Int32  1 instance         33555456 m_stateFlags
00007ffd9f500410  4000dcf       20        System.Object  0 instance 0000000000000000 m_continuationObject
00007ffd9f509f58  4000dd4       28 ...tingentProperties  0 instance 0000000000000000 m_contingentProperties
00007ffd9f501968  4000dba      8b4         System.Int32  1   shared           static s_taskIdCounter
                                 >> Domain:Value  00000086dcba4050:NotInit  <<
00007ffd9f500410  4000dd0      858        System.Object  0   shared           static s_taskCompletionSentinel
                                 >> Domain:Value  00000086dcba4050:NotInit  <<
00007ffd9f4ffac0  4000dd1      8b8       System.Boolean  1   shared           static s_asyncDebuggingEnabled
                                 >> Domain:Value  00000086dcba4050:NotInit  <<
00007ffd9eb1ef50  4000dd2      860 ...Private.CoreLib]]  0   shared           static s_currentActiveTasks
                                 >> Domain:Value  00000086dcba4050:NotInit  <<
00007ffd9f500410  4000dd3      868        System.Object  0   shared           static s_activeTasksLock
                                 >> Domain:Value  00000086dcba4050:NotInit  <<
00007ffd9f4dcc68  4000dd5      870 ...Private.CoreLib]]  0   shared           static s_taskCancelCallback
                                 >> Domain:Value  00000086dcba4050:NotInit  <<
00007ffd9f50a898  4000dd6      878 ...Tasks.TaskFactory  0   shared           static <Factory>k__BackingField
                                 >> Domain:Value  00000086dcba4050:NotInit  <<
00007ffd9f508250  4000dd7      880 ...eading.Tasks.Task  0   shared           static <CompletedTask>k__BackingField
                                 >> Domain:Value  00000086dcba4050:NotInit  <<
00007ffd9f4ddb68  4000dd8      888 ...Private.CoreLib]]  0   shared           static s_IsExceptionObservedByParentPredicate
                                 >> Domain:Value  00000086dcba4050:NotInit  <<
00007ffd9f4ca5e0  4000dd9      890 ...g.ContextCallback  0   shared           static s_ecCallback
                                 >> Domain:Value  00000086dcba4050:NotInit  <<
00007ffd9f4e30c8  4000dda      898 ...Private.CoreLib]]  0   shared           static s_IsTaskContinuationNullPredicate
                                 >> Domain:Value  00000086dcba4050:NotInit  <<
00007ffd9f508250  4000db8       38 ...eading.Tasks.Task  0   shared         TLstatic t_currentTask
    >> Thread:Value <<
00007ffd9f50c518  4000db9       40 ....Tasks.StackGuard  0   shared         TLstatic t_stackGuard
    >> Thread:Value <<
00007ffd9f4cb230  4000d86       38 ...ks.VoidTaskResult  1 instance 00000087bb2c79a8 m_result
00007ffd9f50b8b0  4000d87        0 ...Private.CoreLib]]  0   shared           static s_Factory
                                 >> Domain:Value  00000086dcba4050:NotInit  <<
00007ffd9f4fa470  4000d88        8 ...Private.CoreLib]]  0   shared           static TaskWhenAnyCast
                                 >> Domain:Value  00000086dcba4050:NotInit  <<
00007ffd9f4c4bd8  4001cda       40        System.Action  0 instance 0000000000000000 _moveNextAction
00007ffd40d6b8c0  4001cdb       50 ...+<<RunSync>b__0>d  1 instance 00000087bb2c79c0 StateMachine
00007ffd9f5007a0  4001cdc       48 ....ExecutionContext  0 instance 00000086de90a008 Context
00007ffd9f4ca5e0  4001cd9        8 ...g.ContextCallback  0   static  <no information>

Looking at the state machine:

0:054> !DumpVC /d 00007ffd40d6b8c0 00000087bb2c79c0
Name:        Raven.Client.Util.AsyncHelpers+<>c__DisplayClass2_1+<<RunSync>b__0>d
MethodTable: 00007ffd40d6b8c0
EEClass:     00007ffd40d5acc8
Size:        64(0x40) bytes
File:        C:\work\ravendb_ppekrol\test\SlowTests\bin\Release\netcoreapp2.1\Raven.Client.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffd9f501968  400114e        8         System.Int32  1 instance                0 <>1__state
00007ffd9f5141f8  400114f       10 ...VoidMethodBuilder  1 instance 00000087bb2c79d0 <>t__builder
00007ffd40d69680  4001150        0 ...__DisplayClass2_1  0 instance 00000087bb2c6ff0 <>4__this
00007ffd9f529e28  4001151       20 ...iguredTaskAwaiter  1 instance 00000087bb2c79e0 <>u__1

we can see that its _state field is 0, which means we're currently at the first await waiting for that task to complete, and drilling into that, we can see it's waiting for a RequestExecutor.ExecuteAsync to complete:

0:054> !DumpVC /d 00007ffd9f529e28 00000087bb2c79e0
Name:        System.Runtime.CompilerServices.ConfiguredTaskAwaitable+ConfiguredTaskAwaiter
MethodTable: 00007ffd9f529e28
EEClass:     00007ffd9ebe7ee8
Size:        32(0x20) bytes
File:        C:\Program Files\dotnet\shared\Microsoft.NETCore.App\2.1.0-rtm-26516-03\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffd9f508250  4001cee        0 ...eading.Tasks.Task  0 instance 00000087bb2c7818 m_task
00007ffd9f4ffac0  4001cef        8       System.Boolean  1 instance                0 m_continueOnCapturedContext
0:054> !DumpObj /d 00000087bb2c7818
Name:        System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib],[Raven.Client.Http.RequestExecutor+<ExecuteAsync>d__75`1[[Raven.Client.Json.BatchCommandResult, Raven.Client]], Raven.Client]]

Continuing to drill in we see we're waiting on the preferredTask, which is an HttpClient request to http://127.0.0.1:57235/databases/Referenced_Docs_Are_Indexed_During_Heavy_Writing_2284/bulk_docs, and looking at the associated handler, it appears it's been custom configured with a 12 hr timeout. So from what I see, this is simply that there's an HTTP request that hasn't completed and has been given 12 hours to timeout... I would look into your code to see why that is.

As an aside, I also noticed you're using WinHttpHandler under the HttpClientHandler. Since SocketsHttpHandler is the default in 2.1, that means you've explicitly opted in to using WinHttpHandler? It looks like you did so in your AppContext settings. I'd be interested in understanding why that is.

ayende commented 6 years ago

Thank you for the detailed investigation.

We are using WinHttpHandler because of this issue: https://github.com/dotnet/corefx/issues/29453 This is fixed, but we haven't yet rolled back the workaround for that because the RC bits still have this issue.

The problem is that this hangs mysteriously. We have also seen it in some other cases with similar stack trace. It seems that in certain cases, HTTP requests just never end, or wait until their full timeout? We are pretty sure that on the server side, we are serving this properly and the request isn't lost.

ppekrol commented 6 years ago

@stephentoub We have created another dump (much simpler). You are correct that we are waiting in the AsyncHelpers for the task to finish indefinitely, but we noticed following stack in that dump:

    ntdll.dll!00007ffdce5e0a9a()    Unknown
    KERNELBASE.dll!00007ffdcba7121a()   Unknown
    [Managed to Native Transition]  
    System.Private.CoreLib.dll!System.Threading.SpinWait.SpinOnce(int sleep1Threshold)  Unknown
    System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.CompleteCore.__CompleteCoreSpin|172_0() Line 1138    C#
    System.Net.Sockets.dll!System.Net.Sockets.SocketAsyncEventArgs.Complete() Line 448  C#
>   System.Net.Sockets.dll!System.Net.Sockets.Socket.SendAsync(System.Net.Sockets.SocketAsyncEventArgs e) Line 4187 C#
    Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.dll!Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketSender.SendAsync(System.ReadOnlyMemory<byte> memory) Unknown
    Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.dll!Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketSender.SendAsync(System.Buffers.ReadOnlySequence<byte> buffers)  Unknown
    Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.dll!Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketConnection.ProcessSends()    Unknown
    System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)   Unknown
    System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncTaskMethodBuilder<System.Threading.Tasks.VoidTaskResult>.AsyncStateMachineBox<Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.SocketConnection.<ProcessSends>d__25>.MoveNext()   Unknown
    Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.dll!Microsoft.AspNetCore.Server.Kestrel.Transport.Sockets.Internal.IOQueue.DoWork()   Unknown
    System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state)   Unknown
    System.Private.CoreLib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch()  Unknown
    [Native to Managed Transition]  
    kernel32.dll!00007ffdcdf913d2() Unknown
    ntdll.dll!00007ffdce5654f4()    Unknown

It looks like we are looping here https://github.com/dotnet/corefx/blob/4ca09c7662197c36acb4abb91959f8baf3c88cd4/src/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.Windows.cs#L1138-L1141

because of a socket error catched here https://github.com/dotnet/corefx/blob/4ca09c7662197c36acb4abb91959f8baf3c88cd4/src/System.Net.Sockets/src/System/Net/Sockets/Socket.cs#L4221-L4230

The error is:

{System.ObjectDisposedException}
    Data: {System.Collections.ListDictionaryInternal}
    HResult: -2146232798
    HelpLink: null
    InnerException: null
    Message: "Cannot access a disposed object.\r\nObjectDisposed_ObjectName_Name"
    ObjectName: "System.Net.Sockets.Socket"
    Source: "System.Net.Sockets"
    StackTrace: "   at System.Net.Sockets.SafeCloseSocket.GetOrAllocateThreadPoolBoundHandle(Boolean trySkipCompletionPortOnSuccess) in E:\\A\\_work\\2\\s\\corefx\\src\\Common\\src\\System\\Net\\SafeCloseSocket.Windows.cs:line 41\r\n   at System.Net.Sockets.Socket.GetOrAllocateThreadPoolBoundHandleSlow() in E:\\A\\_work\\2\\s\\corefx\\src\\System.Net.Sockets\\src\\System\\Net\\Sockets\\Socket.Windows.cs:line 297\r\n   at System.Net.Sockets.SocketAsyncEventArgs.DoOperationSendSingleBuffer(SafeCloseSocket handle) in E:\\A\\_work\\2\\s\\corefx\\src\\System.Net.Sockets\\src\\System\\Net\\Sockets\\SocketAsyncEventArgs.Windows.cs:line 571\r\n   at System.Net.Sockets.Socket.SendAsync(SocketAsyncEventArgs e) in E:\\A\\_work\\2\\s\\corefx\\src\\System.Net.Sockets\\src\\System\\Net\\Sockets\\Socket.cs:line 4192"
    TargetSite: {System.Threading.ThreadPoolBoundHandle GetOrAllocateThreadPoolBoundHandle(Boolean)}

Pasting StackTrace for readability:

   at System.Net.Sockets.SafeCloseSocket.GetOrAllocateThreadPoolBoundHandle(Boolean trySkipCompletionPortOnSuccess) in E:\A\_work\2\s\corefx\src\Common\src\System\Net\SafeCloseSocket.Windows.cs:line 41
   at System.Net.Sockets.Socket.GetOrAllocateThreadPoolBoundHandleSlow() in E:\A\_work\2\s\corefx\src\System.Net.Sockets\src\System\Net\Sockets\Socket.Windows.cs:line 297
   at System.Net.Sockets.SocketAsyncEventArgs.DoOperationSendSingleBuffer(SafeCloseSocket handle) in E:\A\_work\2\s\corefx\src\System.Net.Sockets\src\System\Net\Sockets\SocketAsyncEventArgs.Windows.cs:line 571
   at System.Net.Sockets.Socket.SendAsync(SocketAsyncEventArgs e) in E:\A\_work\2\s\corefx\src\System.Net.Sockets\src\System\Net\Sockets\Socket.cs:line 4192

Dump is available here: https://drive.google.com/file/d/1PSiKHq4MFzHrJICxhgOSbZdz15ZzQn-Q/view?usp=sharing

Could this be the case? We decreased the HttpClient timeout from 12h to 5min but we did not receive any timeouts.

ppekrol commented 6 years ago

Interesting, switching back to SocketsHttpHandler does not cause any hangs, so it must be related to the WinHttpHandler and Kestrel communication.

stephentoub commented 6 years ago

I've opened https://github.com/dotnet/corefx/issues/29785 for the sockets issue. That's a regression in 2.1, introduced in https://github.com/dotnet/corefx/pull/27728.

I'm not sure if that's the whole issue, though. I know kestrel had some issues related to request shutdown; it's possible you're hitting one of those, too, and something about the WinHttpHandler timing / patterns / etc is triggering it. @halter73, does this look familiar?

I'm going to close this issue, though, as the sockets issue is now tracked separately, and if there's another issue here it would seem to be on the ASP.NET side of things.

Thanks for raising it.