Closed stephentoub closed 3 years ago
Some Debug.Assert failure in System.Text.RegularExpressions.RegexCharClass
What is the assert? Was it recent? Was it on a PR related to regex or was it somewhere else?
That means there is more than 1 cause for the given ExitCode on osx.1013.amd64.open
Exit code 134 just means the process crashed, often because of an unhandled exception or fail fast from an assert. It maps to SIGABRT.
System.Diagnostics.DebugProvider+DebugAssertException: RegexCharClass s_propTable is out of order at (IsAlphabeticPresentationForms, IsArabic)
Was that on a PR, and if so, which one?
Yes, it failed on refs/pull/402/merge
on 2020-05-10
I used this query
WorkItems
| where FriendlyName == "System.IO.FileSystem.Watcher.Tests"
and ExitCode == 134
and Started > ago(30d)
and QueueName =~ "osx.1013.amd64.open"
| join kind=inner Jobs on JobId
| order by Started desc
Thanks. That's a draft PR that's changing the implementation of string comparisons: https://github.com/dotnet/runtime/pull/402 It's not a real issue to be investigating. Tons of tests are broken in that PR.
We've started hitting this ObjectDisposedException in AspNetCore
do you have core files @BrennanConroy ?
This seems like it could also be cause of https://github.com/dotnet/runtime/issues/37629. @stephentoub what do you think?
@ericstj Can we move this back to 5.0 please
This seems like it could also be cause of #37629. @stephentoub what do you think?
It could be. It's hard to say for sure, but this could be evidence of the portion of my comment at https://github.com/dotnet/runtime/issues/30056#issuecomment-631099004: "But then the FileSystemEventCallback associated with this RunningInstance is invoked (which is dubious if nothing is keeping alive the delegate that was handed out to the native function)".
@stephentoub I agree with your assessment, I think this should be fixed for reliability. @carlossanlop we should try to fit this in 5.0 if possible.
One trick we can use to repro this is force the scenario @stephentoub describes and manually invoke the GC + WaitForPendingFinalizers. This can force the case where we are executing instance method of collected object.
Failed again in https://github.com/dotnet/runtime/pull/44548
/private/tmp/helix/working/9DDF08A4/w/BF320A67/e /private/tmp/helix/working/9DDF08A4/w/BF320A67/e
Discovering: System.IO.FileSystem.Watcher.Tests (method display = ClassAndMethod, method display options = None)
Discovered: System.IO.FileSystem.Watcher.Tests (found 150 of 183 test cases)
Starting: System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Unhandled exception. System.ObjectDisposedException: Safe handle has been closed.
Object name: 'SafeHandle'.
at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/SafeHandle.cs:line 148
at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success) in /_/src/coreclr/src/System.Private.CoreLib/src/System/StubHelpers.cs:line 1255
at Interop.EventStream.FSEventStreamUnscheduleFromRunLoop(SafeEventStreamHandle streamRef, IntPtr runLoop, SafeCreateHandle runLoopMode)
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.UnscheduleFromRunLoop(SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 221
at System.IO.FileSystemWatcher.RunningInstance.CleanupEventStream() in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 277
at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 353
./RunTests.sh: line 161: 22239 Abort trap: 6 (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.IO.FileSystem.Watcher.Tests.runtimeconfig.json --depsfile System.IO.FileSystem.Watcher.Tests.deps.json xunit.console.dll
Plus instructions on how to debug it: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-44548-merge-776b26492e0f484d90/System.IO.FileSystem.Watcher.Tests/how-to-debug-dump.md?sv=2019-07-07&se=2020-12-01T20%3A53%3A20Z&sr=c&sp=rl&sig=%2F9zqfSrsTS2iGkGZYGFeCnbLFQn6fIF3Phbu0sxP3rg%3D
Is someone following up on the SafeHandle ObjectDisposedException? Happened again in https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-44835-merge-c7c5501b02fc4bc9b9/System.IO.FileSystem.Watcher.Tests/console.16d3d036.log?sv=2019-07-07&se=2020-12-07T23%3A20%3A58Z&sr=c&sp=rl&sig=MI6jI0MFjRG2yXR5cnPvbQflr6ldFxG5gB2o425JXYo%3D.
I'm planning to take a look but to is not not at the top priority right now. I downloaded the core dup @safern posted and briefly look at it - to verify I can.
Happened again with dump and all:
Looking at runfo and only into rolling builds, this failed in 3 builds in the last 14 days: https://runfo.azurewebsites.net/search/tests/?bq=definition%3Aruntime+started%3A%7E14+kind%3Arolling&tq=name%3A%22System.IO.FileSystem.Watcher.Tests%22
It is happening ~8 times a month, across all macOS versions...
WorkItems
| where Finished > now(-90d)
| where FriendlyName startswith "System.IO.FileSystem.Watcher.Tests"
| where ExitCode == 134
| join kind=inner Jobs on JobId
| where Branch == "refs/heads/master"
| sort by Finished desc
| summarize count(), any(ConsoleUri) by bin(Finished, 1d), ExitCode//, QueueName
| render columnchart with (ycolumns = count_, xcolumn = Finished, series=ExitCode, yaxis=log) //, QueueName
Happened again on https://dev.azure.com/dnceng/public/_build/results?buildId=998815&view=ms.vss-test-web.build-test-results-tab&runId=31247430&resultId=175805&paneView=attachments with dump and all again.
@wfurt have you had a chance to look into the dump?
Another occurrence in https://dev.azure.com/dnceng/public/_build/results?buildId=1014122 with the dump and log and instructions.
Converted into a live tracking issue. Seems to still be happening quite often (twice per day if the data is right). As this has been happening for months I hope someone can take a look soon?
Including PR's this is now failing 2 days out of 3, using my query above with the branch removed. That's too high to postpone indefinitely, even if there aren't any customer reports.
@wfurt you were looking at it, do you plan to look at a dump? I know you're heavily booked.
@stephentoub from the hypothesis:
The fix would be to root things appropriately with a GCHandle, and ensure as well that something with a finalizer is able to free that GCHandle (i.e. the GCHandle doesn't end up indirectly rooting itself).
I can give this a try.
Could you please confirm the following assumptions I'm making?:
RunningInstance
using GCHandle.Alloc
using GCHandleType.Normal
, not GCHandleType.Pinned
, because only one object (the FileSystemWatcher
instance itself) needs to know about the rooted RunningInstance
.FileSystemWatcher
, but would only exist in the OSX version.FinalizeDispose
and in StopRaisingEvents
(if IsAllocated == true
).StartRaisingEvents
, but need to make sure it's not allocated yet (otherwise, throw) because that would mean the user called StartRaisingEvents
more than once.I'm slightly skeptical of my theory only because I'd have expected things to go bad in multiple ways, and not just with regards to this SafeHandle be used after disposal, but maybe it just takes so many things to be lined up correctly that we just don't see it that often.
Can you please share any other concerns that you have about this theory?
Edit: I have a draft PR to test the above: https://github.com/dotnet/runtime/pull/51538
While this is on my interest radar, I did not have chance to put any real effort into this @danmoseley. I did few attempt in the past to reproduce it locally but I was never able to trigger it.
@wfurt thanks, no problem. :)
Happened again in https://dev.azure.com/dnceng/public/_build/results?buildId=1101792&view=ms.vss-test-web.build-test-results-tab&runId=33675932&resultId=185078&paneView=attachments
core.19320:
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-51617-merge-d28ed1ecfe424ac9b6/System.IO.FileSystem.Watcher.Tests/core.19320?sv=2019-07-07&se=2021-05-12T12%3A26%3A15Z&sr=c&sp=rl&sig=QGBGMcJGRSVGl7mLjeQ0%2FO6G9BvH3KHvLGh8mZ3gZhI%3D
how-to-debug-dump.md:
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-51617-merge-d28ed1ecfe424ac9b6/System.IO.FileSystem.Watcher.Tests/how-to-debug-dump.md?sv=2019-07-07&se=2021-05-12T12%3A26%3A15Z&sr=c&sp=rl&sig=QGBGMcJGRSVGl7mLjeQ0%2FO6G9BvH3KHvLGh8mZ3gZhI%3D
console.23cebcee.log:
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-51617-merge-d28ed1ecfe424ac9b6/System.IO.FileSystem.Watcher.Tests/console.23cebcee.log?sv=2019-07-07&se=2021-05-12T12%3A26%3A15Z&sr=c&sp=rl&sig=QGBGMcJGRSVGl7mLjeQ0%2FO6G9BvH3KHvLGh8mZ3gZhI%3D
@carlossanlop how's your test go?
@carlossanlop Carlos Sanchez Lopez FTE how's your test go?
The CI passed in my test, but I'm not 100% confident that this is the correct fix. Can I please get a review?
Unsure if this is the same issue. I see a NullReferenceException in FileSystemWatcher.RunningInstance.FileSystemEventCallback
:
Unhandled Exception:
System.NullReferenceException: Object reference not set to an instance of an object
at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 354
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(ManualResetEventSlim runLoopStarted, SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 255
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.<>c.<ScheduleEventStream>b__3_0(Object args) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 205
at System.Threading.Thread.StartCallback() in /_/src/mono/System.Private.CoreLib/src/System/Threading/Thread.Mono.cs:line 236
Configuration: net6.0-OSX-Debug-x64-Mono_release-OSX.1015.Amd64.Open
@ViktorHofer not sure that's the same thing, but I don't know why it could be either -- there is no obvious way that code can ever NRE: https://github.com/dotnet/runtime/blob/e7b743cae325d4720fceaf7f84f68aaaf7bb7dbe/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L354
I see it segfaulted though -- is that what Mono test runs do when there's an unhandled exceptoin that brings down the process (like in this case) or a separate segfault?
Oh, I see @jkotas you believe this is a dupe?
I think so.
@jkotas - I see the FileWatcher tests failing on OSX in this PR:
https://github.com/dotnet/runtime/pull/52252/checks?check_run_id=2502547808
runtime (Libraries Test Run release coreclr OSX x64 Debug) .packages/microsoft.dotnet.helix.sdk/6.0.0-beta.21222.1/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets#L78 .packages/microsoft.dotnet.helix.sdk/6.0.0-beta.21222.1/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(78,5): error : (NETCORE_ENGINEERING_TELEMETRY=Test) Work item System.IO.FileSystem.Watcher.Tests in job d239ae94-91a2-4a9c-b9e1-2191863d7385 has failed. Failure log: https://helix.dot.net/api/2019-06-17/jobs/d239ae94-91a2-4a9c-b9e1-2191863d7385/workitems/System.IO.FileSystem.Watcher.Tests/console
/private/tmp/helix/working/AB6B08F2/w/A87108BA/e /private/tmp/helix/working/AB6B08F2/w/A87108BA/e
Discovering: System.IO.FileSystem.Watcher.Tests (method display = ClassAndMethod, method display options = None)
Discovered: System.IO.FileSystem.Watcher.Tests (found 149 of 183 test cases)
Starting: System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Process terminated. Assertion failed.
at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 386
at Interop.RunLoop.CFRunLoopRun()
at Interop.RunLoop.CFRunLoopRun()
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(ManualResetEventSlim runLoopStarted, SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 254
at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.<>c.<ScheduleEventStream>b__3_0(Object args) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 204
at System.Threading.Thread.StartCallback() in /_/src/coreclr/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs:line 105
./RunTests.sh: line 162: 4690 Abort trap: 6 (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.IO.FileSystem.Watcher.Tests.runtimeconfig.json --depsfile System.IO.FileSystem.Watcher.Tests.deps.json xunit.console.dll System.IO.FileSystem.Watcher.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/private/tmp/helix/working/AB6B08F2/w/A87108BA/e
----- end Tue May 4 09:55:46 PDT 2021 ----- exit code 134 ----------------------------------------------------------
exit code 134 means SIGABRT Abort. Managed or native assert, or runtime check such as heap corruption, caused call to abort(). Core dumped.
ulimit -c value: unlimited
Is that this same issue? I hope we got a dump of the process so we can figure the issue out.
Yes, it is the same issue.
This assert says that the file watcher callback can be called even after the EventStream was stopped using FSEventStreamStop
. So we have to be robust against that situation and use a different scheme to perform the cleanup.
I hope we got a dump of the process so we can figure the issue out.
@jaredpar - for my education, how do you find that link?
Clicked on the checks tab and went to the AzDO build
They key part there is the build id is 1120772. From there just ask runfo to dump the data
https://runfo.azurewebsites.net/view/build/?number=1120772
It will list all of the failed tests and for the FileWatcherTests there was a heap dump noted and linked to
The link I posted is that one.
Truthfully I was a bit lazier than that. I didn't click on the checks tab. Really I just went here
https://runfo.azurewebsites.net/view/pr-builds/?Repository=runtime&Number=52252
Basically that dumps all of the builds that happened for the PR #52252. I happened to know that test ran on the runtime
definition so I click here
That is the link I posted above .
Even easier than that, in the dotnet/runtime repo we generate a file containing commands that you can copy/paste to get from zero to the debugger:
Try it out :)
Haha, can't debug though cause Windows does not like the core dump here
Even easier than that, in the dotnet/runtime repo we generate a file containing commands that you can copy/paste to get from zero to the debugger:
From PR to that display is seven clicks, runfo is two. And one of those clicks is a spinning wheel click waiting for that display to render :)
Joking aside I'm glad that display is getting better but it does take a lot of clicks to get to. It also takes time to render and is not 100% reliable. Found several times over the years where it won't display anything and errors out.
Another attempt to fix this #52275
OSX.1013.Amd64.Open-x64:Debug https://mc.dot.net/#/user/dotnet-bot/pr~2Fdotnet~2Fcorefx~2Frefs~2Fpull~2F38953~2Fmerge/test~2Ffunctional~2Fcli~2Finnerloop~2F/20190626.70/workItem/System.IO.FileSystem.Watcher.Tests/wilogs
cc: @JeremyKuhne, @carlossanlop