dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.22k stars 4.72k forks source link

SafeHandle use-after-dispose in FileSystemWatcher on OSX #30056

Closed stephentoub closed 3 years ago

stephentoub commented 5 years ago

OSX.1013.Amd64.Open-x64:Debug https://mc.dot.net/#/user/dotnet-bot/pr~2Fdotnet~2Fcorefx~2Frefs~2Fpull~2F38953~2Fmerge/test~2Ffunctional~2Fcli~2Finnerloop~2F/20190626.70/workItem/System.IO.FileSystem.Watcher.Tests/wilogs

Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at Interop.RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(Object args) in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 247
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs:line 201
Fatal error. Internal CLR error. (0x80131506)
   at Interop+RunLoop.CFRunLoopRun()
   at Interop+RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher+RunningInstance+StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(System.Object)
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)

cc: @JeremyKuhne, @carlossanlop

Runfo Tracking Issue: system.io.filesystem.watcher.tests work item Build Definition Kind Run Name Console Core Dump Test Results Run Client
1082967 runtime PR 51078 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1082899 runtime PR 51099 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump
1082624 runtime PR 51090 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1080937 runtime Rolling net5.0-Linux-Release-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-16.04-helix-arm64v8-bfcd90a-20200127194925 console.log runclient.py
1072066 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump runclient.py
1072066 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump
1071508 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump
1071508 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump
1069555 runtime Rolling net6.0-windows-Release-x64-CoreCLR_checked-Windows.10.Amd64.Open console.log runclient.py
1069328 runtime Rolling net6.0-windows-Release-x86-CoreCLR_release-Windows.10.Amd64.ServerRS5.Open console.log runclient.py
1068131 runtime PR 50257 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1066426 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump
1066426 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump
1066285 runtime PR 50507 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log runclient.py
1065570 runtime PR 50483 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1065167 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1015.Amd64.Open console.log core dump runclient.py
1063683 runtime PR 48527 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1059736 runtime PR 50232 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1059736 runtime PR 50232 net6.0-windows-Debug-x64-CoreCLR_checked-Windows.10.Amd64.Open console.log runclient.py
1059736 runtime PR 50232 net6.0-Linux-Debug-x64-CoreCLR_checked-Ubuntu.1804.Amd64.Open console.log core dump runclient.py
1059736 runtime PR 50232 net6.0-Linux-Debug-x64-CoreCLR_checked-(Alpine.312.Amd64.Open)ubuntu.1604.amd64.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-20200602002622-e06dc59 console.log core dump runclient.py
1058168 runtime PR 50280 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1057844 runtime PR 50275 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1055598 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1050603 runtime PR 49990 net6.0-windows-Debug-x64-CoreCLR_checked-Windows.10.Amd64.Open console.log
1050314 runtime PR 48601 net6.0-windows-Debug-x64-CoreCLR_checked-Windows.10.Amd64.Open console.log runclient.py
1049560 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1047468 runtime PR 49545 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1047249 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1013.Amd64.Open console.log core dump runclient.py
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-mono_interpreter_release-Debian.9.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-(Centos.8.Amd64.Open)Ubuntu.1604.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:centos-8-helix-20201229003624-c1bf759 console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-RedHat.7.Amd64.Open console.log core dump runclient.py
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-(Debian.10.Amd64.Open)ubuntu.1604.amd64.open@mcr.microsoft.com/dotnet-buildtools/prereqs:debian-10-helix-amd64-bfcd90a-20200121150006 console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-Ubuntu.1604.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-Ubuntu.1804.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-SLES.15.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-(Fedora.30.Amd64.Open)ubuntu.1604.amd64.open@mcr.microsoft.com/dotnet-buildtools/prereqs:fedora-30-helix-20200512010621-4f8cef7 console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-16.04-helix-arm64v8-20210106155927-56c6673 console.log core dump
1046026 runtime PR 49740 net6.0-OSX-Debug-x64-Mono_release-OSX.1014.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-OSX-Debug-x64-Mono_release-OSX.1015.Amd64.Open console.log core dump
1045010 runtime PR 48369 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log runclient.py
1044737 runtime PR 49373 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1015.Amd64.Open console.log core dump runclient.py
1044726 runtime PR 49532 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1043361 runtime PR 49747 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1042998 runtime PR 49741 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1041823 runtime PR 49251 net6.0-windows-Release-x86-CoreCLR_checked-Windows.10.Amd64.Open console.log runclient.py
1039858 runtime PR 43706 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump
1039858 runtime PR 43706 net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump
1039185 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1038808 runtime PR 49613 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1037652 runtime PR 49529 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1037058 runtime PR 41097 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1015.Amd64.Open console.log core dump runclient.py
Build Result Summary Day Hit Count Week Hit Count Month Hit Count
3 4 33
stephentoub commented 5 years ago

cc: @MaximLipnin Maybe related to https://github.com/dotnet/corefx/pull/33865

danmoseley commented 5 years ago

@stephentoub do you know what could cause a NRE in a native call into macOS? Is the interop layer translating some kind of errno returned from the OS here?

Then I see we get a COR_E_EXECUTIONENGINE while trying to deal with the exception.

stephentoub commented 5 years ago

Not sure; @jkoritzinsky, @AaronRobinsonMSFT, ideas?

AaronRobinsonMSFT commented 5 years ago

@danmosemsft and @stephentoub The COR_E_EXECUTIONENGINE can be caused by an invalid P/Invoke stub or invalid translation as well. I did recently fix a few issues in macOS regarding race conditions in shutdown - https://github.com/dotnet/coreclr/pull/24847. How often is this happening? https://github.com/dotnet/corefx/pull/33865 does look promising too. Can someone point to the P/Invoke that is causing this issue?

stephentoub commented 5 years ago

@AaronRobinsonMSFT, I only saw it that once. This is the P/Invoke: https://github.com/dotnet/corefx/blob/2aa2518a47c541148e2dc3f9d040942e44395acd/src/Common/src/Interop/OSX/Interop.RunLoop.cs#L33-L35 where CoreFoundationLibrary is: https://github.com/dotnet/corefx/blob/2aa2518a47c541148e2dc3f9d040942e44395acd/src/Common/src/Interop/OSX/Interop.Libraries.cs#L9

danmoseley commented 5 years ago

For future reference, it seems console output cannot directly be queried in Kusto. The best we can easily do is look for 134

WorkItems 
| where Started > now(-10d)  
| where FriendlyName == "System.IO.FileSystem.Watcher.Tests"
| where ExitCode == 134
| join kind= inner (
   Jobs  | where Started > now(-10d) | project  QueueName , JobId, Build, Type, Source 
) on JobId
| summarize count() by ExitCode, QueueName
| order by count_ desc

ExitCode QueueName count_ 134 Ubuntu.1604.Arm32.Open 28 134 Ubuntu.1604.Arm64.Docker.Open 25 134 Ubuntu.1804.Amd64.Open 6 134 OSX.1013.Amd64.Open 1


compared to ~5000 with 0 exit code.
AaronRobinsonMSFT commented 5 years ago

@danmosemsft Do we know what version of coreclr is being used to run the tests themselves? Is it possible the coreclr being used in this instance is older than the date at which https://github.com/dotnet/coreclr/pull/24847 was checked in?

stephentoub commented 5 years ago

We'd have to dig to see which exact version, but this was 5 days ago, and corefx ingested a new coreclr 5 days ago, 6 days ago, and 7 days ago, so it definitely has that change from end of May.

danmoseley commented 5 years ago

@mjanecke can you tell from the output at https://mc.dot.net/#/user/dotnet-bot/pr~2Fdotnet~2Fcorefx~2Frefs~2Fpull~2F38953~2Fmerge/test~2Ffunctional~2Fcli~2Finnerloop~2F/20190626.70/workItem/System.IO.FileSystem.Watcher.Tests/wilogs whether a dump got uploaded? I don't see the usual dumpling song and dance.

missymessa commented 5 years ago

@danmosemsft, for that test, it looks like the failure happened on a Mac, which I don't think @epananth has finished implementing for dump collection yet.

danmoseley commented 5 years ago

@mjanecke good point, and last I heard we did not have a way to create a dump on Mac (that could be debugged with managed debuggers). @mikem8361 any update on that? Is our only option to catch this under an attached debugger?

(It's quite possible this isn't Mac specific actually but I'm interested anyway)

mikem8361 commented 5 years ago

The best way would be to catch this under a managed debugger. There is no “createdump” facility on macOS. You can setup your macOS to generate system core dumps (ulimit -c unlimited, etc.) but there a limitation in lldb that makes SOS on coredumps on macOS difficult. See https://github.com/dotnet/diagnostics/blob/master/documentation/debugging-coredump.md#launch-lldb-under-macos.

mikem

danmoseley commented 5 years ago

Since this is clearly mac specific we can confidently say it has only happened 1 time in 10 days.

danmoseley commented 5 years ago

No more hits in the last week. Query:

WorkItems 
| where Started > now(-10d)  
| where FriendlyName == "System.IO.FileSystem.Watcher.Tests"
| where ExitCode == 134
| join kind= inner (
   Jobs  | where Started > now(-10d) | project  QueueName , JobId, Build, Type, Source 
) on JobId
| summarize count() by ExitCode, QueueName
| order by count_ desc
       
  ExitCode QueueName count_
  134 Ubuntu.1604.Arm32.Open 3
carlossanlop commented 5 years ago

Seems to be infrastructure related. Moving to 5.0.

jkotas commented 4 years ago

Failed in dotnet/corefx#42512:

  Starting:    System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Unhandled exception. System.ObjectDisposedException: Safe handle has been closed.
Object name: 'SafeHandle'.
   at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success) in /_/src/System.Private.CoreLib/shared/System/Runtime/InteropServices/SafeHandle.cs:line 156
   at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success) in /_/src/System.Private.CoreLib/src/System/StubHelpers.cs:line 1744
   at Interop.EventStream.FSEventStreamUnscheduleFromRunLoop(SafeEventStreamHandle streamRef, IntPtr runLoop, SafeCreateHandle runLoopMode)
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.UnscheduleFromRunLoop(SafeEventStreamHandle eventStream) in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 215
   at System.IO.FileSystemWatcher.RunningInstance.CancellationCallback() in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 276
   at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 370
   at Interop.RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(Object args) in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 249
   at System.Threading.ThreadHelper.ThreadStart_Context(Object state) in /_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs:line 50
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs:line 201
Fatal error. Internal CLR error. (0x80131506)
   at Interop+RunLoop.CFRunLoopRun()
   at Interop+RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher+RunningInstance+StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(System.Object)
   at System.Threading.ThreadHelper.ThreadStart_Context(System.Object)
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Threading.ThreadHelper.ThreadStart(System.Object)
danmoseley commented 4 years ago

@wfurt

wfurt commented 4 years ago

I will take a look. We still have few flaky watcher tests on OSX and my plan was to run whole set in loop and chase them down.

ericstj commented 4 years ago

Failed again in https://github.com/dotnet/runtime/pull/136. Same stack. Here's a dump: https://helix.dot.net/api/2019-06-17/jobs/b40ed2a0-5a9f-4d81-9bb5-62900426af75/workitems/System.IO.FileSystem.Watcher.Tests/files/core.18074 (it's 3.4 GB! get it while its hot / still available)

wfurt commented 4 years ago

The core file is incomplete or corrupted:

(lldb) target create --core "/Users/furt/Downloads/core.18074"
warning: (x86_64) /Users/furt/Downloads/core.18074 load command 3055 LC_SEGMENT_64 has a fileoff + filesize (0xda14f000) that extends beyond the end of the file (0xda14e000), the segment will be truncated to match
warning: (x86_64) /Users/furt/Downloads/core.18074 load command 3056 LC_SEGMENT_64 has a fileoff (0xda14f000) that extends beyond the end of the file (0xda14e000), ignoring this section
Core file '/Users/furt/Downloads/core.18074' (x86_64) was loaded.
(lldb) plugin load /Users/furt/github/diagnostics/artifacts/bin/OSX.x64.Debug/libsosplugin.dylib
(lldb) clrstack
Failed to load data access module, 0x80004005
Can not load or initialize libmscordaccore.dylib. The target runtime may not be initialized.
ClrStack  failed

We should never limit core size as that makes them unusable. Also it is full of zeros and compressed size is only ~ 400M. We should not dump them raw to storage if we care about size. Should we open new issue to make core dumps on OSX usable @ericstj ? And if so I'm not what is correct repo and team to work on this.

wfurt commented 4 years ago

cc: @epananth for core file part.

ericstj commented 4 years ago

I'm guessing https://github.com/dotnet/arcade and tag @mjanecke @alexperovich @ChadNedzlek

mikem8361 commented 4 years ago

The clrstack error message means that SOS can’t find the right version of the DAC. Try enabling download support with “setsymbolserver -ms” and SOS will attempt to download the correct version. There still may be problems with the dump and there still is the swift lldb bug with coredump thread ids (see https://github.com/dotnet/diagnostics/blob/master/documentation/debugging-coredump.md#launch-lldb-under-macos) for more information.

mikem

From: Tomas Weinfurt notifications@github.com Sent: Wednesday, November 20, 2019 11:57 AM To: dotnet/corefx corefx@noreply.github.com Cc: Mike McLaughlin mikem@microsoft.com; Mention mention@noreply.github.com Subject: Re: [dotnet/corefx] Unhandled NullReferenceException on macOS from FileSystemWatcher (#38966)

The core file is incomplete or corrupted:

(lldb) target create --core "/Users/furt/Downloads/core.18074"

warning: (x86_64) /Users/furt/Downloads/core.18074 load command 3055 LC_SEGMENT_64 has a fileoff + filesize (0xda14f000) that extends beyond the end of the file (0xda14e000), the segment will be truncated to match

warning: (x86_64) /Users/furt/Downloads/core.18074 load command 3056 LC_SEGMENT_64 has a fileoff (0xda14f000) that extends beyond the end of the file (0xda14e000), ignoring this section

Core file '/Users/furt/Downloads/core.18074' (x86_64) was loaded.

(lldb) plugin load /Users/furt/github/diagnostics/artifacts/bin/OSX.x64.Debug/libsosplugin.dylib

(lldb) clrstack

Failed to load data access module, 0x80004005

Can not load or initialize libmscordaccore.dylib. The target runtime may not be initialized.

ClrStack failed

We should never limit core size as that makes them unusable. Also it is full of zeros and compressed size is only ~ 400M. We should not dump them raw to storage if we care about size. Should we open new issue to make core dumps on OSX usable @ericstjhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fericstj&data=02%7C01%7Cmikem%40microsoft.com%7C1e551ace51504e4a546e08d76df3c47a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637098766046674512&sdata=IcmY6MYuoioMkZ2AV23AjSKwGhC7xilyYVvIAMROb2k%3D&reserved=0 ? And if so I'm not what is correct repo and team to work on this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fcorefx%2Fissues%2F38966%3Femail_source%3Dnotifications%26email_token%3DACPYK22FC5YXARV7BNZRDN3QUWI7VA5CNFSM4H3WOBPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEULIOQ%23issuecomment-556315706&data=02%7C01%7Cmikem%40microsoft.com%7C1e551ace51504e4a546e08d76df3c47a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637098766046684509&sdata=ZB6bXCVnZMW9GNoWvxZefDG%2FlImrIvfo%2FBb9skLB8U0%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACPYK26PPOSJ4WJ7H2KS4HDQUWI7VANCNFSM4H3WOBPA&data=02%7C01%7Cmikem%40microsoft.com%7C1e551ace51504e4a546e08d76df3c47a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637098766046684509&sdata=qJsX4rmbe0WtXxtMqU4e6MVRpZ%2BWHZCBQz%2FR27SXRHs%3D&reserved=0.

xieofxie commented 4 years ago

@jkotas, hi, I met the same issue as you posted on macOS (on Windows it works fine..). Is it being fixed? Thanks~

wfurt commented 4 years ago

no. If you have repro, please share it @xieofxie .

xieofxie commented 4 years ago

@wfurt It is generated by a validation in azure devops for the pr above.. I don't know how to reproduce it steadily..

jkotas commented 4 years ago

Hit by https://github.com/dotnet/runtime/pull/2107. From https://helix.dot.net/api/2019-06-17/jobs/003fe36f-508e-40d9-9a7e-f35e3a3aea65/workitems/System.IO.FileSystem.Watcher.Tests/files/console.ce2942ec.log :

/tmp/helix/working/AC3109A7/p/dotnet exec --runtimeconfig System.IO.FileSystem.Watcher.Tests.runtimeconfig.json --depsfile System.IO.FileSystem.Watcher.Tests.deps.json xunit.console.dll System.IO.FileSystem.Watcher.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing -notrait category=nonnetcoreapptests -notrait category=nonosxtests 
popd
===========================================================================================================
/private/tmp/helix/working/AC3109A7/w/A3D808CF/e /private/tmp/helix/working/AC3109A7/w/A3D808CF/e
  Discovering: System.IO.FileSystem.Watcher.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.IO.FileSystem.Watcher.Tests (found 151 of 184 test cases)
  Starting:    System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Unhandled exception. System.ObjectDisposedException: Safe handle has been closed.
Object name: 'SafeHandle'.
   at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/SafeHandle.cs:line 149
   at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success) in /_/src/coreclr/src/System.Private.CoreLib/src/System/StubHelpers.cs:line 1751
   at Interop.EventStream.FSEventStreamUnscheduleFromRunLoop(SafeEventStreamHandle streamRef, IntPtr runLoop, SafeCreateHandle runLoopMode)
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.UnscheduleFromRunLoop(SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 220
   at System.IO.FileSystemWatcher.RunningInstance.CancellationCallback() in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 278
   at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 370
   at Interop.RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(ManualResetEventSlim runLoopStarted, SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 251
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.<>c.<ScheduleEventStream>b__3_0(Object args) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 204
   at System.Threading.ThreadHelper.ThreadStart_Context(Object state) in /_/src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs:line 47
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs:line 201
Fatal error. Internal CLR error. (0x80131506 (COR_E_EXECUTIONENGINE))
danmoseley commented 4 years ago

@jkotas just curious as an aside, is it expected that the CLR should exit with Internal CLR error. (0x80131506 (COR_E_EXECUTIONENGINE)) when user code has simply thrown an exception on a background thread?

jkotas commented 4 years ago

That is not expected either. It may be either side-effect of the original problem; or a different bug.

wfurt commented 4 years ago

It seems like we did not collect core file despite 'Core dumped' message.

cc: @epananth @MattGal in case I'm mistaken.

MattGal commented 4 years ago

It seems like we did not collect core file despite 'Core dumped' message.

cc: @epananth @MattGal in case I'm mistaken.

From the log for the above work item, there wasn't a file found on the cores folder when the work item finished.

2020-02-24T18:57:27.719Z    INFO    io(27)  copy_tree_to    b"Copying '/tmp/helix/working/AC3109A7/w/A3D808CF/u/zh-Hant/Microsoft.VisualStudio.TestPlatform.ObjectModel.resources.dll' to '/tmp/helix/working/AC3109A7/w/A3D808CF/e/zh-Hant/Microsoft.VisualStudio.TestPlatform.ObjectModel.resources.dll'"
2020-02-24T18:57:27.720Z    INFO    executor(109)   run Successfully downloaded work item payload
2020-02-24T18:57:27.720Z    INFO    executor(631)   _execute_command    running $HELIX_CORRELATION_PAYLOAD/scripts/27f91e9998934da69568877a9a85d01a/execute.sh in /tmp/helix/working/AC3109A7/w/A3D808CF/e max 900 seconds
2020-02-24T18:57:38.713Z    INFO    executor(608)   _dump_file_upload   No dump files to upload
2020-02-24T18:57:38.715Z    INFO    executor(721)   _upload_files   File ready to be upload : console.ce2942ec.log
2020-02-24T18:57:38.715Z    INFO    executor(690)   _upload_single_file Uploading the file console.ce2942ec.log

We can investigate the specific machine in question (dci-mac-build-012) but if dumps work on it for unhandled exceptions we have to consider the other possibilities.

MattGal commented 4 years ago

I spent some time checking this out with @epananth and it seems the helix-runner user just didn't have permissions to the /cores folder on that machine. We're checking it out now.

epananth commented 4 years ago

@Wfurt After updating the permission, we are able to create core dumps in the cores folder. 10.13 did not come with /cores directory so we had to manually add it and give the right permissions to it. Unfortunately this did not have the right permission,.I think that is the reason this failed.

wfurt commented 4 years ago

Thanks for the update @MattGal & @epananth. I'm positive there will be crashes in future :)

epananth commented 4 years ago

Also spoke to @ilyas1974 he said that this particular machine dci-mac-build-012, was recently updated, so chances are one might have missed some set up scripts. But we do make that change for all 10.13 machines.

ilya-bin commented 4 years ago

I think you are confusing me with someone else, sorry i don't host mac-build machines.

On 25 Feb 2020, at 05:57, Epsitha Ananth notifications@github.com wrote:

 Also spoke to @ilya-bin he said that this particular machine dci-mac-build-012, was recently updated, so chances are one might have missed some set up scripts. But we do make that change for all 10.13 machines.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

danmoseley commented 4 years ago

Should be @ilyas1974 sorry @ilya-bin

jaredpar commented 4 years ago

Haven't seen this in last 100 builds. Removing "blocking-clean-ci". If this comes up again we should consider disabling until we get to the bottom to help get PRs / CI off the floor.

safern commented 4 years ago

Just happened again: https://dev.azure.com/dnceng/public/_build/results?buildId=562826&view=logs

/private/tmp/helix/working/B7480941/w/A8CA0943/e /private/tmp/helix/working/B7480941/w/A8CA0943/e
  Discovering: System.IO.FileSystem.Watcher.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.IO.FileSystem.Watcher.Tests (found 151 of 184 test cases)
  Starting:    System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Unhandled exception. System.ObjectDisposedException: Safe handle has been closed.
Object name: 'SafeHandle'.
   at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/SafeHandle.cs:line 149
   at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success) in /_/src/coreclr/src/System.Private.CoreLib/src/System/StubHelpers.cs:line 1747
   at Interop.EventStream.FSEventStreamUnscheduleFromRunLoop(SafeEventStreamHandle streamRef, IntPtr runLoop, SafeCreateHandle runLoopMode)
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.UnscheduleFromRunLoop(SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 220
   at System.IO.FileSystemWatcher.RunningInstance.CancellationCallback() in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 278
   at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 370
   at Interop.RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(ManualResetEventSlim runLoopStarted, SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 251
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.<>c.<ScheduleEventStream>b__3_0(Object args) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 204
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs:line 201
Fatal error. Internal CLR error. (0x80131506 (COR_E_EXECUTIONENGINE))
Repeat 2 times:
--------------------------------
   at Interop+RunLoop.CFRunLoopRun()
--------------------------------
   at System.IO.FileSystemWatcher+RunningInstance+StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(System.Threading.ManualResetEventSlim, Microsoft.Win32.SafeHandles.SafeEventStreamHandle)
   at System.IO.FileSystemWatcher+RunningInstance+StaticWatcherRunLoopManager+<>c.<ScheduleEventStream>b__3_0(System.Object)
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)

This time there is a core dump: https://helix.dot.net/api/2019-06-17/jobs/b62f6bde-4acb-47a1-8375-a2905024d639/workitems/System.IO.FileSystem.Watcher.Tests/files/core.34665

danmoseley commented 4 years ago

@wfurt seems you were looking for a dump here?

wfurt commented 4 years ago

I did. I can see handle closed but that does not tell me why. We can put in some bandaide but it would be nice to figure out what is really going on. I was not able to reproduce but I put in only small effort.

ViktorHofer commented 4 years ago

Happened again in https://dev.azure.com/dnceng/public/_build/results?buildId=589913&view=ms.vss-test-web.build-test-results-tab&runId=18593304&resultId=172079&paneView=attachments

Configuration: netcoreapp5.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open

  Discovering: System.IO.FileSystem.Watcher.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.IO.FileSystem.Watcher.Tests (found 151 of 184 test cases)
  Starting:    System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Unhandled exception. System.ObjectDisposedException: Safe handle has been closed.
Object name: 'SafeHandle'.
   at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/SafeHandle.cs:line 149
   at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success) in /_/src/coreclr/src/System.Private.CoreLib/src/System/StubHelpers.cs:line 1758
   at Interop.EventStream.FSEventStreamUnscheduleFromRunLoop(SafeEventStreamHandle streamRef, IntPtr runLoop, SafeCreateHandle runLoopMode)
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.UnscheduleFromRunLoop(SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 220
   at System.IO.FileSystemWatcher.RunningInstance.CancellationCallback() in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 278
   at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 370

console.352e39f2.log: https://helix.dot.net/api/2019-06-17/jobs/55d612f0-35cc-408f-b401-7a7e0a381703/workitems/System.IO.FileSystem.Watcher.Tests/files/console.352e39f2.log

core.84604: https://helix.dot.net/api/2019-06-17/jobs/55d612f0-35cc-408f-b401-7a7e0a381703/workitems/System.IO.FileSystem.Watcher.Tests/files/core.84604

jaredpar commented 4 years ago

FWIW @ViktorHofer PR is the only time I see this failure in the last 100 builds. So it does look fairly infrequent but also seems like we have a fix in the works here.

eerhardt commented 4 years ago

I got this same error on my PR today - re-opening the issue to track the CI failure.

https://dev.azure.com/dnceng/public/_build/results?buildId=651821&view=logs&jobId=e36ba317-696d-52ae-17e4-c1550b11ed0e&j=e36ba317-696d-52ae-17e4-c1550b11ed0e

https://helix.dot.net/api/2019-06-17/jobs/24fa8408-1806-4bc5-bfa3-647246eb9e5f/workitems/System.IO.FileSystem.Watcher.Tests/console

/private/tmp/helix/working/A517094C/w/B80409AD/e /private/tmp/helix/working/A517094C/w/B80409AD/e
  Discovering: System.IO.FileSystem.Watcher.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.IO.FileSystem.Watcher.Tests (found 150 of 183 test cases)
  Starting:    System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Unhandled exception. System.ObjectDisposedException: Safe handle has been closed.
Object name: 'SafeHandle'.
   at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/SafeHandle.cs:line 149
   at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success) in /_/src/coreclr/src/System.Private.CoreLib/src/System/StubHelpers.cs:line 1758
   at Interop.EventStream.FSEventStreamUnscheduleFromRunLoop(SafeEventStreamHandle streamRef, IntPtr runLoop, SafeCreateHandle runLoopMode)
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.UnscheduleFromRunLoop(SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 226
   at System.IO.FileSystemWatcher.RunningInstance.CleanupEventStream() in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 282
   at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 358
./RunTests.sh: line 161: 69523 Abort trap: 6           (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.IO.FileSystem.Watcher.Tests.runtimeconfig.json --depsfile System.IO.FileSystem.Watcher.Tests.deps.json xunit.console.dll System.IO.FileSystem.Watcher.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/private/tmp/helix/working/A517094C/w/B80409AD/e
----- end Tue May 19 13:26:13 PDT 2020 ----- exit code 134 ----------------------------------------------------------
stephentoub commented 4 years ago

@jaredpar, do you know if we have seen this at all in recent history other than in @eerhardt's example?

wfurt commented 4 years ago

@alnikola may have the answer as well. We can probably just count all unfinished test runs since I'm not awake of any other issue causing crashes.

stephentoub commented 4 years ago

From looking at the code, I have a hypothesis for how this is happening. Different reason from the issue I previously fixed, but same symptom.

When the FileSystemWatcher is started on OSX, it creates a new instance of RunningInstance and Starts it: https://github.com/dotnet/runtime/blob/8f929a123fcd9b9c8d7841a26ff87e4843c8ffa5/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L52-L53 The Start method creates this event stream SafeHandle and stores it into a field on the instance: https://github.com/dotnet/runtime/blob/8f929a123fcd9b9c8d7841a26ff87e4843c8ffa5/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L311 But... it appears nothing is rooting this RunningInstance instance. That _eventStream is passed to ScheduleEventStream: https://github.com/dotnet/runtime/blob/8f929a123fcd9b9c8d7841a26ff87e4843c8ffa5/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L324 but if the event loop is already running, servicing another FileSystemWatcher, then nothing will root _eventStream, either: https://github.com/dotnet/runtime/blob/8f929a123fcd9b9c8d7841a26ff87e4843c8ffa5/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L196-L202 So, we could end up in a situation where the FileSystemWatcher gets dropped, and with nothing rooting the SafeHandle, it gets finalized. But then the FileSystemEventCallback associated with this RunningInstance is invoked (which is dubious if nothing is keeping alive the delegate that was handed out to the native function), it can't find the watcher in its weak reference, and it proceeeds to call CleanupEventStream: https://github.com/dotnet/runtime/blob/8f929a123fcd9b9c8d7841a26ff87e4843c8ffa5/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L356-L360 which will in turn try to use the _eventStream: https://github.com/dotnet/runtime/blob/8f929a123fcd9b9c8d7841a26ff87e4843c8ffa5/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L269-L286 It doesn't matter that my previous fix successfully ensures that Close can be called on the SafeHandle only by the thread that successfully swapped it out of _eventStream, because the race is actually with finalization. Hooray for resurrection :(

That's my working theory, at least. The fix would be to root things appropriately with a GCHandle, and ensure as well that something with a finalizer is able to free that GCHandle (i.e. the GCHandle doesn't end up indirectly rooting itself). I'm slightly skeptical of my theory only because I'd have expected things to go bad in multiple ways, and not just with regards to this SafeHandle be used after disposal, but maybe it just takes so many things to be lined up correctly that we just don't see it that often.

MattGal commented 4 years ago

@jaredpar, do you know if we have seen this at all in recent history other than in @eerhardt's example?

I took a peek:

WorkItems
| where Started > ago(30d)
| where QueueName =~ "osx.1013.amd64.open"
| where FriendlyName  == "System.IO.FileSystem.Watcher.Tests"
| where ExitCode == 134

If we assume all SIGABRTs are this failure (which is worst-case), and checking out Kusto, this has only happened 15 times (out of ~5600 times this specific test has run in the past 30 days across pretty much every machine in OSX 10.13, there's never more than 2 in a single day. Whatever it is, it's able to repro on nearly every machine and is always around.

wfurt commented 4 years ago

would it make sense to add some temporary state variable so we can track operations and verify the theory @stephentoub ? With two crashed per day we should be able to get core file and verification within reasonable time.

alnikola commented 4 years ago

@MattGal @wfurt @stephentoub I ran the above query, checked console logs for a few result rows and found out at least 2 types of failures:

  1. System.ObjectDisposedException: Safe handle has been closed mentioned above
  2. Some Debug.Assert failure in System.Text.RegularExpressions.RegexCharClass

That means there is more than 1 cause for the given ExitCode on osx.1013.amd64.open