dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

Process crash when System.IO.Pipes.PipeCompletionSource.Cancel() throws ObjectDisposedExeption #39902

Open AArnott opened 4 years ago

AArnott commented 4 years ago

Description

In switching our process from .NET Framework to .NET Core, it (nearly?) always crashes on shutdown. The crashing exception is shown below, which reveals what appears to be an attempt to cancel async I/O. The CancellationToken registration callback calls CancelIoEx after checking that the handle is valid. CancelIoEx then throws because the handle is disposed. Perhaps a race condition between completion of the async I/O and its cancellation?

https://github.com/dotnet/runtime/blob/5d1af65dc66d289d64e54814a4d5e91412b75fe4/src/libraries/System.IO.Pipes/src/System/IO/Pipes/PipeCompletionSource.cs#L145

Instead of a crash, I expect canceling async I/O to not throw ObjectDisposedException and instead quietly cancel or complete.

The exception thrown becomes a crash because I'm calling CancellationTokenSource.Cancel() on a threadpool thread, such that when one of the cancellation callbacks throws, it propagates through my frame and to the threadpool itself.

Configuration

The dump came from .NET Core runtime 5.0.0-preview.7.20364.11 @Commit: 53976d38b1bd6917b8fa4d1dd4f009728ece3adb x64 process.

Regression?

Yes. The .NET Framework version of this process never crashed in this way as far as we've noticed.

Other information

dump file

0:014> !pe
Exception object: 00000215232d5fe8
Exception type:   System.AggregateException
Message:          One or more errors occurred.
InnerException:   System.AggregateException, Use !PrintException 00000215232BC008 to see more.
StackTrace (generated):
    SP               IP               Function
    000000586987EE20 00007FFF28C91573 System_IO_Pipelines!System.IO.Pipelines.PipeCompletionCallbacks.Execute()+0x133
    000000586987EE80 00007FFF28C90EDF System_IO_Pipelines!System.IO.Pipelines.Pipe+<>c.<.cctor>b__90_2(System.Object)+0x2f
    000000586987EEB0 00007FFF2871DEF4 System_IO_Pipelines!System.IO.Pipelines.ThreadPoolScheduler+<>c.<Schedule>b__0_0(System.Object)+0x74
    000000586987EF00 00007FFF8500887A System_Private_CoreLib!System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()+0x1a
    000000586987EF30 00007FFF84F17420 System_Private_CoreLib!System.Threading.ThreadPoolWorkQueue.Dispatch()+0x1d0
    000000586987EFE0 00007FFF84F1723B System_Private_CoreLib!System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()+0xb

StackTraceString: <none>
HResult: 80131500
0:014> !PrintException /d 00000215232BC008
Exception object: 00000215232bc008
Exception type:   System.AggregateException
Message:          One or more errors occurred.
InnerException:   System.ObjectDisposedException, Use !PrintException 00000215232BBB20 to see more.
StackTrace (generated):
    SP               IP               Function
    000000586987EC90 00007FFF84EF2529 System_Private_CoreLib!System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean)+0x2b9
    000000586987ED20 00007FFF84EF2247 System_Private_CoreLib!System.Threading.CancellationTokenSource.NotifyCancellation(Boolean)+0x47
    000000586987ED60 00007FFF84EF21F3 System_Private_CoreLib!System.Threading.CancellationTokenSource.Cancel()+0x13
    000000586987ED90 00007FFF28C917C3 Nerdbank_Streams!Nerdbank.Streams.PipeExtensions+<>c.<UsePipeReader>b__19_0(System.Exception, System.Object)+0x33
    000000586987EDC0 00007FFF28C91636 System_IO_Pipelines!System.IO.Pipelines.PipeCompletionCallbacks.Execute(System.IO.Pipelines.PipeCompletionCallback, System.Collections.Generic.List`1<System.Exception> ByRef)+0x56

StackTraceString: <none>
HResult: 80131500
0:014> !PrintException /d 00000215232BBB20
Exception object: 00000215232bbb20
Exception type:   System.ObjectDisposedException
Message:          Safe handle has been closed.
InnerException:   <none>
StackTrace (generated):
    SP               IP               Function
    000000586987EA80 00007FFF84EDF171 System_Private_CoreLib!System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean ByRef)+0xa1
    000000586987EAC0 00007FFF84EDF0A6 System_Private_CoreLib!System.StubHelpers.StubHelpers.SafeHandleAddRef(System.Runtime.InteropServices.SafeHandle, Boolean ByRef)+0x16
    000000586987EAF0 0000000000000000 System_IO_Pipes!Interop+Kernel32.CancelIoEx(System.Runtime.InteropServices.SafeHandle, System.Threading.NativeOverlapped*)+0x1
    000000586987EBC0 00007FFF28C91F47 System_IO_Pipes!System.IO.Pipes.PipeCompletionSource`1[[System.Int32, System.Private.CoreLib]].Cancel()+0x57
    000000586987EC10 00007FFF28C9195F System_IO_Pipes!System.IO.Pipes.PipeCompletionSource`1+<>c[[System.Int32, System.Private.CoreLib]].<RegisterForCancellation>b__14_0(System.Object)+0x2f
    000000586987EC40 00007FFF850013EA System_Private_CoreLib!System.Threading.CancellationTokenSource+CallbackNode.ExecuteCallback()+0x7a
    000000586987EC90 00007FFF84EF24B9 System_Private_CoreLib!System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean)+0x249

StackTraceString: <none>
HResult: 80131622
danmoseley commented 4 years ago

Pipes is part of IO. Pipelines has its own label.

@AArnott if there's a way to set up a repro that would probably make it more likely we can investigate this soon.

AArnott commented 4 years ago

These folks have the repro: @genlu @RyanToth3 @tmat Can one of you share it?

genlu commented 4 years ago

@RyanToth3 I think you have a simpler repro with ServiceHub tests? Could you please share that? Otherwise, following this instruction would repro it. (basically this runs Roslyn OOP on .NetCore servicehub)

  1. Go here https://devdiv.visualstudio.com/DevDiv/_packaging?_a=feed&feed=VSIDE-RealSigned-Release%40Local, and search for “microsoft.servicehub.host.clr”.

  2. Download the host nuget package. The newest version of the package should be fine as long as you’re using a recent-ish build of VS.

  3. Unzip the package and under the tools directory, copy the coreClr directory to your VS install under Common7/ServiceHub/Hosts.

  4. Open Common7\ServiceHub\Hosts\coreClr\ServiceHub.Host.CLR.runtimeconfig.json, and change it to targeting 3.1

    {
    "runtimeOptions": {
    "tfm": "netcoreapp3.1",
    "framework": {
      "name": "Microsoft.NETCore.App",
      "version": "3.1.0"
    }
    }
    }
  5. Install this Roslyn vsix https://microsoft-my.sharepoint.com/:u:/p/gel/Ea411yRcX31Pq3P331UR2L0B28X3TIbHKxt1WW7Ywc3VNw?e=oDXjqW

  6. Set these environmental variables to control the behavior of Roslyn remote service:

    • RoslynServiceHubCore : set to no empty to use .net core servicehub
    • RoslynServiceHubCoreDebug : set to no empty to launch debugger and break before we start initializing remote service. This would help you to observe the error under debugger.
  7. Launch VS, open a C# project and wait a bit for servicehub to spin-up.

RyanToth3 commented 4 years ago

Here's an app that repros the issue: https://microsoft-my.sharepoint.com/:u:/p/rytoth/EfzfFjfQd8FLu2_dUrdoCpUBxc0LQUV8Raij9OsIgJeyuw

Just run "dotnet ServiceHub.Sample.NugetClientApp.dll" nd you should see the error printed to the console.

ericstj commented 4 years ago

Perhaps a race condition between completion of the async I/O and its cancellation?

That doesn't agree with the frequency you are seeing this.

I see the code you pointed out only checks if the handle IsInvalid not closed. I see that the ReleaseHandle method doesn't actually clear the handle: https://github.com/dotnet/runtime/blob/6072e4d3a7a2a1493f514cdf4be75a3d56580e84/src/libraries/System.IO.Pipes/src/Microsoft/Win32/SafeHandles/SafePipeHandle.Windows.cs#L16

If I had to guess this isn't a race with completion, it's just that you are calling cancel on pipe that was disposed. I haven't dug into the dumps or repro yet, but that's my hunch.

carlossanlop commented 4 years ago

@AArnott did you get a chance to check if the pipe was being disposed before calling cancel as Eric suggested?

AArnott commented 4 years ago

@carlossanlop I didn't interpret @ericstj's comment as a suggestion toward me as much as speculation as to the conditions leading to the failure for consideration in a fix. It's quite possible that we close a pipe and then cancel a token that was used in I/O that was pending on that pipe. It's likely that races exist where the two steps happen concurrently, even if in this rather deterministic repro it's consistently ordered as dispose-then-cancel. Investigating the order this happens on our side would take a few hours at least. So if either of you two confirm it would help in your investigation and fix, we can pay that cost. But even if a race isn't responsible for this crash, the code does seem to be vulnerable to the race condition as well as simply sequentially executed dispose-and-cancel. I hope both can be fixed, if I'm right.