Closed stephentoub closed 3 years ago
cc: @MaximLipnin Maybe related to https://github.com/dotnet/corefx/pull/33865
@stephentoub do you know what could cause a NRE in a native call into macOS? Is the interop layer translating some kind of errno
returned from the OS here?
Then I see we get a COR_E_EXECUTIONENGINE while trying to deal with the exception.
Not sure; @jkoritzinsky, @AaronRobinsonMSFT, ideas?
@danmosemsft and @stephentoub The COR_E_EXECUTIONENGINE
can be caused by an invalid P/Invoke stub or invalid translation as well. I did recently fix a few issues in macOS regarding race conditions in shutdown - https://github.com/dotnet/coreclr/pull/24847. How often is this happening? https://github.com/dotnet/corefx/pull/33865 does look promising too. Can someone point to the P/Invoke that is causing this issue?
@AaronRobinsonMSFT, I only saw it that once. This is the P/Invoke: https://github.com/dotnet/corefx/blob/2aa2518a47c541148e2dc3f9d040942e44395acd/src/Common/src/Interop/OSX/Interop.RunLoop.cs#L33-L35 where CoreFoundationLibrary is: https://github.com/dotnet/corefx/blob/2aa2518a47c541148e2dc3f9d040942e44395acd/src/Common/src/Interop/OSX/Interop.Libraries.cs#L9
For future reference, it seems console output cannot directly be queried in Kusto. The best we can easily do is look for 134
WorkItems
| where Started > now(-10d)
| where FriendlyName == "System.IO.FileSystem.Watcher.Tests"
| where ExitCode == 134
| join kind= inner (
Jobs | where Started > now(-10d) | project QueueName , JobId, Build, Type, Source
) on JobId
| summarize count() by ExitCode, QueueName
| order by count_ desc
ExitCode QueueName count_ 134 Ubuntu.1604.Arm32.Open 28 134 Ubuntu.1604.Arm64.Docker.Open 25 134 Ubuntu.1804.Amd64.Open 6 134 OSX.1013.Amd64.Open 1
compared to ~5000 with 0 exit code.
@danmosemsft Do we know what version of coreclr is being used to run the tests themselves? Is it possible the coreclr being used in this instance is older than the date at which https://github.com/dotnet/coreclr/pull/24847 was checked in?
We'd have to dig to see which exact version, but this was 5 days ago, and corefx ingested a new coreclr 5 days ago, 6 days ago, and 7 days ago, so it definitely has that change from end of May.
@mjanecke can you tell from the output at https://mc.dot.net/#/user/dotnet-bot/pr~2Fdotnet~2Fcorefx~2Frefs~2Fpull~2F38953~2Fmerge/test~2Ffunctional~2Fcli~2Finnerloop~2F/20190626.70/workItem/System.IO.FileSystem.Watcher.Tests/wilogs whether a dump got uploaded? I don't see the usual dumpling song and dance.
@danmosemsft, for that test, it looks like the failure happened on a Mac, which I don't think @epananth has finished implementing for dump collection yet.
@mjanecke good point, and last I heard we did not have a way to create a dump on Mac (that could be debugged with managed debuggers). @mikem8361 any update on that? Is our only option to catch this under an attached debugger?
(It's quite possible this isn't Mac specific actually but I'm interested anyway)
The best way would be to catch this under a managed debugger. There is no “createdump” facility on macOS. You can setup your macOS to generate system core dumps (ulimit -c unlimited, etc.) but there a limitation in lldb that makes SOS on coredumps on macOS difficult. See https://github.com/dotnet/diagnostics/blob/master/documentation/debugging-coredump.md#launch-lldb-under-macos.
mikem
Since this is clearly mac specific we can confidently say it has only happened 1 time in 10 days.
No more hits in the last week. Query:
WorkItems
| where Started > now(-10d)
| where FriendlyName == "System.IO.FileSystem.Watcher.Tests"
| where ExitCode == 134
| join kind= inner (
Jobs | where Started > now(-10d) | project QueueName , JobId, Build, Type, Source
) on JobId
| summarize count() by ExitCode, QueueName
| order by count_ desc
ExitCode | QueueName | count_ | |
134 | Ubuntu.1604.Arm32.Open | 3 |
Seems to be infrastructure related. Moving to 5.0.
Failed in dotnet/corefx#42512:
Starting: System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Unhandled exception. System.ObjectDisposedException: Safe handle has been closed.
Object name: 'SafeHandle'.
at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success) in /_/src/System.Private.CoreLib/shared/System/Runtime/InteropServices/SafeHandle.cs:line 156
at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success) in /_/src/System.Private.CoreLib/src/System/StubHelpers.cs:line 1744
at Interop.EventStream.FSEventStreamUnscheduleFromRunLoop(SafeEventStreamHandle streamRef, IntPtr runLoop, SafeCreateHandle runLoopMode)
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.UnscheduleFromRunLoop(SafeEventStreamHandle eventStream) in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 215
at System.IO.FileSystemWatcher.RunningInstance.CancellationCallback() in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 276
at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 370
at Interop.RunLoop.CFRunLoopRun()
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(Object args) in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 249
at System.Threading.ThreadHelper.ThreadStart_Context(Object state) in /_/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs:line 50
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs:line 201
Fatal error. Internal CLR error. (0x80131506)
at Interop+RunLoop.CFRunLoopRun()
at Interop+RunLoop.CFRunLoopRun()
at System.IO.FileSystemWatcher+RunningInstance+StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(System.Object)
at System.Threading.ThreadHelper.ThreadStart_Context(System.Object)
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Threading.ThreadHelper.ThreadStart(System.Object)
@wfurt
I will take a look. We still have few flaky watcher tests on OSX and my plan was to run whole set in loop and chase them down.
Failed again in https://github.com/dotnet/runtime/pull/136. Same stack. Here's a dump: https://helix.dot.net/api/2019-06-17/jobs/b40ed2a0-5a9f-4d81-9bb5-62900426af75/workitems/System.IO.FileSystem.Watcher.Tests/files/core.18074 (it's 3.4 GB! get it while its hot / still available)
The core file is incomplete or corrupted:
(lldb) target create --core "/Users/furt/Downloads/core.18074"
warning: (x86_64) /Users/furt/Downloads/core.18074 load command 3055 LC_SEGMENT_64 has a fileoff + filesize (0xda14f000) that extends beyond the end of the file (0xda14e000), the segment will be truncated to match
warning: (x86_64) /Users/furt/Downloads/core.18074 load command 3056 LC_SEGMENT_64 has a fileoff (0xda14f000) that extends beyond the end of the file (0xda14e000), ignoring this section
Core file '/Users/furt/Downloads/core.18074' (x86_64) was loaded.
(lldb) plugin load /Users/furt/github/diagnostics/artifacts/bin/OSX.x64.Debug/libsosplugin.dylib
(lldb) clrstack
Failed to load data access module, 0x80004005
Can not load or initialize libmscordaccore.dylib. The target runtime may not be initialized.
ClrStack failed
We should never limit core size as that makes them unusable. Also it is full of zeros and compressed size is only ~ 400M. We should not dump them raw to storage if we care about size. Should we open new issue to make core dumps on OSX usable @ericstj ? And if so I'm not what is correct repo and team to work on this.
cc: @epananth for core file part.
I'm guessing https://github.com/dotnet/arcade and tag @mjanecke @alexperovich @ChadNedzlek
The clrstack error message means that SOS can’t find the right version of the DAC. Try enabling download support with “setsymbolserver -ms” and SOS will attempt to download the correct version. There still may be problems with the dump and there still is the swift lldb bug with coredump thread ids (see https://github.com/dotnet/diagnostics/blob/master/documentation/debugging-coredump.md#launch-lldb-under-macos) for more information.
mikem
From: Tomas Weinfurt notifications@github.com Sent: Wednesday, November 20, 2019 11:57 AM To: dotnet/corefx corefx@noreply.github.com Cc: Mike McLaughlin mikem@microsoft.com; Mention mention@noreply.github.com Subject: Re: [dotnet/corefx] Unhandled NullReferenceException on macOS from FileSystemWatcher (#38966)
The core file is incomplete or corrupted:
(lldb) target create --core "/Users/furt/Downloads/core.18074"
warning: (x86_64) /Users/furt/Downloads/core.18074 load command 3055 LC_SEGMENT_64 has a fileoff + filesize (0xda14f000) that extends beyond the end of the file (0xda14e000), the segment will be truncated to match
warning: (x86_64) /Users/furt/Downloads/core.18074 load command 3056 LC_SEGMENT_64 has a fileoff (0xda14f000) that extends beyond the end of the file (0xda14e000), ignoring this section
Core file '/Users/furt/Downloads/core.18074' (x86_64) was loaded.
(lldb) plugin load /Users/furt/github/diagnostics/artifacts/bin/OSX.x64.Debug/libsosplugin.dylib
(lldb) clrstack
Failed to load data access module, 0x80004005
Can not load or initialize libmscordaccore.dylib. The target runtime may not be initialized.
ClrStack failed
We should never limit core size as that makes them unusable. Also it is full of zeros and compressed size is only ~ 400M. We should not dump them raw to storage if we care about size. Should we open new issue to make core dumps on OSX usable @ericstjhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fericstj&data=02%7C01%7Cmikem%40microsoft.com%7C1e551ace51504e4a546e08d76df3c47a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637098766046674512&sdata=IcmY6MYuoioMkZ2AV23AjSKwGhC7xilyYVvIAMROb2k%3D&reserved=0 ? And if so I'm not what is correct repo and team to work on this.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fcorefx%2Fissues%2F38966%3Femail_source%3Dnotifications%26email_token%3DACPYK22FC5YXARV7BNZRDN3QUWI7VA5CNFSM4H3WOBPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEULIOQ%23issuecomment-556315706&data=02%7C01%7Cmikem%40microsoft.com%7C1e551ace51504e4a546e08d76df3c47a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637098766046684509&sdata=ZB6bXCVnZMW9GNoWvxZefDG%2FlImrIvfo%2FBb9skLB8U0%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACPYK26PPOSJ4WJ7H2KS4HDQUWI7VANCNFSM4H3WOBPA&data=02%7C01%7Cmikem%40microsoft.com%7C1e551ace51504e4a546e08d76df3c47a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637098766046684509&sdata=qJsX4rmbe0WtXxtMqU4e6MVRpZ%2BWHZCBQz%2FR27SXRHs%3D&reserved=0.
@jkotas, hi, I met the same issue as you posted on macOS (on Windows it works fine..). Is it being fixed? Thanks~
no. If you have repro, please share it @xieofxie .
@wfurt It is generated by a validation in azure devops for the pr above.. I don't know how to reproduce it steadily..
Hit by https://github.com/dotnet/runtime/pull/2107. From https://helix.dot.net/api/2019-06-17/jobs/003fe36f-508e-40d9-9a7e-f35e3a3aea65/workitems/System.IO.FileSystem.Watcher.Tests/files/console.ce2942ec.log :
/tmp/helix/working/AC3109A7/p/dotnet exec --runtimeconfig System.IO.FileSystem.Watcher.Tests.runtimeconfig.json --depsfile System.IO.FileSystem.Watcher.Tests.deps.json xunit.console.dll System.IO.FileSystem.Watcher.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing -notrait category=nonnetcoreapptests -notrait category=nonosxtests
popd
===========================================================================================================
/private/tmp/helix/working/AC3109A7/w/A3D808CF/e /private/tmp/helix/working/AC3109A7/w/A3D808CF/e
Discovering: System.IO.FileSystem.Watcher.Tests (method display = ClassAndMethod, method display options = None)
Discovered: System.IO.FileSystem.Watcher.Tests (found 151 of 184 test cases)
Starting: System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Unhandled exception. System.ObjectDisposedException: Safe handle has been closed.
Object name: 'SafeHandle'.
at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/SafeHandle.cs:line 149
at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success) in /_/src/coreclr/src/System.Private.CoreLib/src/System/StubHelpers.cs:line 1751
at Interop.EventStream.FSEventStreamUnscheduleFromRunLoop(SafeEventStreamHandle streamRef, IntPtr runLoop, SafeCreateHandle runLoopMode)
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.UnscheduleFromRunLoop(SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 220
at System.IO.FileSystemWatcher.RunningInstance.CancellationCallback() in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 278
at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 370
at Interop.RunLoop.CFRunLoopRun()
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(ManualResetEventSlim runLoopStarted, SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 251
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.<>c.<ScheduleEventStream>b__3_0(Object args) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 204
at System.Threading.ThreadHelper.ThreadStart_Context(Object state) in /_/src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs:line 47
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs:line 201
Fatal error. Internal CLR error. (0x80131506 (COR_E_EXECUTIONENGINE))
@jkotas just curious as an aside, is it expected that the CLR should exit with Internal CLR error. (0x80131506 (COR_E_EXECUTIONENGINE))
when user code has simply thrown an exception on a background thread?
That is not expected either. It may be either side-effect of the original problem; or a different bug.
It seems like we did not collect core file despite 'Core dumped' message.
cc: @epananth @MattGal in case I'm mistaken.
It seems like we did not collect core file despite 'Core dumped' message.
cc: @epananth @MattGal in case I'm mistaken.
From the log for the above work item, there wasn't a file found on the cores folder when the work item finished.
2020-02-24T18:57:27.719Z INFO io(27) copy_tree_to b"Copying '/tmp/helix/working/AC3109A7/w/A3D808CF/u/zh-Hant/Microsoft.VisualStudio.TestPlatform.ObjectModel.resources.dll' to '/tmp/helix/working/AC3109A7/w/A3D808CF/e/zh-Hant/Microsoft.VisualStudio.TestPlatform.ObjectModel.resources.dll'"
2020-02-24T18:57:27.720Z INFO executor(109) run Successfully downloaded work item payload
2020-02-24T18:57:27.720Z INFO executor(631) _execute_command running $HELIX_CORRELATION_PAYLOAD/scripts/27f91e9998934da69568877a9a85d01a/execute.sh in /tmp/helix/working/AC3109A7/w/A3D808CF/e max 900 seconds
2020-02-24T18:57:38.713Z INFO executor(608) _dump_file_upload No dump files to upload
2020-02-24T18:57:38.715Z INFO executor(721) _upload_files File ready to be upload : console.ce2942ec.log
2020-02-24T18:57:38.715Z INFO executor(690) _upload_single_file Uploading the file console.ce2942ec.log
We can investigate the specific machine in question (dci-mac-build-012) but if dumps work on it for unhandled exceptions we have to consider the other possibilities.
I spent some time checking this out with @epananth and it seems the helix-runner user just didn't have permissions to the /cores folder on that machine. We're checking it out now.
@Wfurt After updating the permission, we are able to create core dumps in the cores folder. 10.13 did not come with /cores directory so we had to manually add it and give the right permissions to it. Unfortunately this did not have the right permission,.I think that is the reason this failed.
Thanks for the update @MattGal & @epananth. I'm positive there will be crashes in future :)
Also spoke to @ilyas1974 he said that this particular machine dci-mac-build-012, was recently updated, so chances are one might have missed some set up scripts. But we do make that change for all 10.13 machines.
I think you are confusing me with someone else, sorry i don't host mac-build machines.
On 25 Feb 2020, at 05:57, Epsitha Ananth notifications@github.com wrote:
Also spoke to @ilya-bin he said that this particular machine dci-mac-build-012, was recently updated, so chances are one might have missed some set up scripts. But we do make that change for all 10.13 machines.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Should be @ilyas1974 sorry @ilya-bin
Haven't seen this in last 100 builds. Removing "blocking-clean-ci". If this comes up again we should consider disabling until we get to the bottom to help get PRs / CI off the floor.
Just happened again: https://dev.azure.com/dnceng/public/_build/results?buildId=562826&view=logs
/private/tmp/helix/working/B7480941/w/A8CA0943/e /private/tmp/helix/working/B7480941/w/A8CA0943/e
Discovering: System.IO.FileSystem.Watcher.Tests (method display = ClassAndMethod, method display options = None)
Discovered: System.IO.FileSystem.Watcher.Tests (found 151 of 184 test cases)
Starting: System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Unhandled exception. System.ObjectDisposedException: Safe handle has been closed.
Object name: 'SafeHandle'.
at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/SafeHandle.cs:line 149
at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success) in /_/src/coreclr/src/System.Private.CoreLib/src/System/StubHelpers.cs:line 1747
at Interop.EventStream.FSEventStreamUnscheduleFromRunLoop(SafeEventStreamHandle streamRef, IntPtr runLoop, SafeCreateHandle runLoopMode)
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.UnscheduleFromRunLoop(SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 220
at System.IO.FileSystemWatcher.RunningInstance.CancellationCallback() in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 278
at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 370
at Interop.RunLoop.CFRunLoopRun()
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(ManualResetEventSlim runLoopStarted, SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 251
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.<>c.<ScheduleEventStream>b__3_0(Object args) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 204
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs:line 201
Fatal error. Internal CLR error. (0x80131506 (COR_E_EXECUTIONENGINE))
Repeat 2 times:
--------------------------------
at Interop+RunLoop.CFRunLoopRun()
--------------------------------
at System.IO.FileSystemWatcher+RunningInstance+StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(System.Threading.ManualResetEventSlim, Microsoft.Win32.SafeHandles.SafeEventStreamHandle)
at System.IO.FileSystemWatcher+RunningInstance+StaticWatcherRunLoopManager+<>c.<ScheduleEventStream>b__3_0(System.Object)
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
This time there is a core dump: https://helix.dot.net/api/2019-06-17/jobs/b62f6bde-4acb-47a1-8375-a2905024d639/workitems/System.IO.FileSystem.Watcher.Tests/files/core.34665
@wfurt seems you were looking for a dump here?
I did. I can see handle closed but that does not tell me why. We can put in some bandaide but it would be nice to figure out what is really going on. I was not able to reproduce but I put in only small effort.
Happened again in https://dev.azure.com/dnceng/public/_build/results?buildId=589913&view=ms.vss-test-web.build-test-results-tab&runId=18593304&resultId=172079&paneView=attachments
Configuration: netcoreapp5.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open
Discovering: System.IO.FileSystem.Watcher.Tests (method display = ClassAndMethod, method display options = None)
Discovered: System.IO.FileSystem.Watcher.Tests (found 151 of 184 test cases)
Starting: System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Unhandled exception. System.ObjectDisposedException: Safe handle has been closed.
Object name: 'SafeHandle'.
at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/SafeHandle.cs:line 149
at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success) in /_/src/coreclr/src/System.Private.CoreLib/src/System/StubHelpers.cs:line 1758
at Interop.EventStream.FSEventStreamUnscheduleFromRunLoop(SafeEventStreamHandle streamRef, IntPtr runLoop, SafeCreateHandle runLoopMode)
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.UnscheduleFromRunLoop(SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 220
at System.IO.FileSystemWatcher.RunningInstance.CancellationCallback() in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 278
at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 370
console.352e39f2.log: https://helix.dot.net/api/2019-06-17/jobs/55d612f0-35cc-408f-b401-7a7e0a381703/workitems/System.IO.FileSystem.Watcher.Tests/files/console.352e39f2.log
FWIW @ViktorHofer PR is the only time I see this failure in the last 100 builds. So it does look fairly infrequent but also seems like we have a fix in the works here.
I got this same error on my PR today - re-opening the issue to track the CI failure.
/private/tmp/helix/working/A517094C/w/B80409AD/e /private/tmp/helix/working/A517094C/w/B80409AD/e
Discovering: System.IO.FileSystem.Watcher.Tests (method display = ClassAndMethod, method display options = None)
Discovered: System.IO.FileSystem.Watcher.Tests (found 150 of 183 test cases)
Starting: System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Unhandled exception. System.ObjectDisposedException: Safe handle has been closed.
Object name: 'SafeHandle'.
at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/SafeHandle.cs:line 149
at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success) in /_/src/coreclr/src/System.Private.CoreLib/src/System/StubHelpers.cs:line 1758
at Interop.EventStream.FSEventStreamUnscheduleFromRunLoop(SafeEventStreamHandle streamRef, IntPtr runLoop, SafeCreateHandle runLoopMode)
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.UnscheduleFromRunLoop(SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 226
at System.IO.FileSystemWatcher.RunningInstance.CleanupEventStream() in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 282
at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 358
./RunTests.sh: line 161: 69523 Abort trap: 6 (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.IO.FileSystem.Watcher.Tests.runtimeconfig.json --depsfile System.IO.FileSystem.Watcher.Tests.deps.json xunit.console.dll System.IO.FileSystem.Watcher.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/private/tmp/helix/working/A517094C/w/B80409AD/e
----- end Tue May 19 13:26:13 PDT 2020 ----- exit code 134 ----------------------------------------------------------
@jaredpar, do you know if we have seen this at all in recent history other than in @eerhardt's example?
@alnikola may have the answer as well. We can probably just count all unfinished test runs since I'm not awake of any other issue causing crashes.
From looking at the code, I have a hypothesis for how this is happening. Different reason from the issue I previously fixed, but same symptom.
When the FileSystemWatcher is started on OSX, it creates a new instance of RunningInstance and Starts it: https://github.com/dotnet/runtime/blob/8f929a123fcd9b9c8d7841a26ff87e4843c8ffa5/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L52-L53 The Start method creates this event stream SafeHandle and stores it into a field on the instance: https://github.com/dotnet/runtime/blob/8f929a123fcd9b9c8d7841a26ff87e4843c8ffa5/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L311 But... it appears nothing is rooting this RunningInstance instance. That _eventStream is passed to ScheduleEventStream: https://github.com/dotnet/runtime/blob/8f929a123fcd9b9c8d7841a26ff87e4843c8ffa5/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L324 but if the event loop is already running, servicing another FileSystemWatcher, then nothing will root _eventStream, either: https://github.com/dotnet/runtime/blob/8f929a123fcd9b9c8d7841a26ff87e4843c8ffa5/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L196-L202 So, we could end up in a situation where the FileSystemWatcher gets dropped, and with nothing rooting the SafeHandle, it gets finalized. But then the FileSystemEventCallback associated with this RunningInstance is invoked (which is dubious if nothing is keeping alive the delegate that was handed out to the native function), it can't find the watcher in its weak reference, and it proceeeds to call CleanupEventStream: https://github.com/dotnet/runtime/blob/8f929a123fcd9b9c8d7841a26ff87e4843c8ffa5/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L356-L360 which will in turn try to use the _eventStream: https://github.com/dotnet/runtime/blob/8f929a123fcd9b9c8d7841a26ff87e4843c8ffa5/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L269-L286 It doesn't matter that my previous fix successfully ensures that Close can be called on the SafeHandle only by the thread that successfully swapped it out of _eventStream, because the race is actually with finalization. Hooray for resurrection :(
That's my working theory, at least. The fix would be to root things appropriately with a GCHandle, and ensure as well that something with a finalizer is able to free that GCHandle (i.e. the GCHandle doesn't end up indirectly rooting itself). I'm slightly skeptical of my theory only because I'd have expected things to go bad in multiple ways, and not just with regards to this SafeHandle be used after disposal, but maybe it just takes so many things to be lined up correctly that we just don't see it that often.
@jaredpar, do you know if we have seen this at all in recent history other than in @eerhardt's example?
I took a peek:
WorkItems
| where Started > ago(30d)
| where QueueName =~ "osx.1013.amd64.open"
| where FriendlyName == "System.IO.FileSystem.Watcher.Tests"
| where ExitCode == 134
If we assume all SIGABRTs are this failure (which is worst-case), and checking out Kusto, this has only happened 15 times (out of ~5600 times this specific test has run in the past 30 days across pretty much every machine in OSX 10.13, there's never more than 2 in a single day. Whatever it is, it's able to repro on nearly every machine and is always around.
would it make sense to add some temporary state variable so we can track operations and verify the theory @stephentoub ? With two crashed per day we should be able to get core file and verification within reasonable time.
@MattGal @wfurt @stephentoub I ran the above query, checked console logs for a few result rows and found out at least 2 types of failures:
System.ObjectDisposedException: Safe handle has been closed
mentioned aboveDebug.Assert
failure in System.Text.RegularExpressions.RegexCharClass
That means there is more than 1 cause for the given ExitCode
on osx.1013.amd64.open
OSX.1013.Amd64.Open-x64:Debug https://mc.dot.net/#/user/dotnet-bot/pr~2Fdotnet~2Fcorefx~2Frefs~2Fpull~2F38953~2Fmerge/test~2Ffunctional~2Fcli~2Finnerloop~2F/20190626.70/workItem/System.IO.FileSystem.Watcher.Tests/wilogs
cc: @JeremyKuhne, @carlossanlop