dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.22k stars 4.72k forks source link

SafeHandle use-after-dispose in FileSystemWatcher on OSX #30056

Closed stephentoub closed 3 years ago

stephentoub commented 5 years ago

OSX.1013.Amd64.Open-x64:Debug https://mc.dot.net/#/user/dotnet-bot/pr~2Fdotnet~2Fcorefx~2Frefs~2Fpull~2F38953~2Fmerge/test~2Ffunctional~2Fcli~2Finnerloop~2F/20190626.70/workItem/System.IO.FileSystem.Watcher.Tests/wilogs

Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at Interop.RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(Object args) in /_/src/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 247
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/System.Private.CoreLib/shared/System/Threading/ExecutionContext.cs:line 201
Fatal error. Internal CLR error. (0x80131506)
   at Interop+RunLoop.CFRunLoopRun()
   at Interop+RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher+RunningInstance+StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(System.Object)
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)

cc: @JeremyKuhne, @carlossanlop

Runfo Tracking Issue: system.io.filesystem.watcher.tests work item Build Definition Kind Run Name Console Core Dump Test Results Run Client
1082967 runtime PR 51078 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1082899 runtime PR 51099 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump
1082624 runtime PR 51090 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1080937 runtime Rolling net5.0-Linux-Release-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-16.04-helix-arm64v8-bfcd90a-20200127194925 console.log runclient.py
1072066 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump runclient.py
1072066 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump
1071508 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump
1071508 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump
1069555 runtime Rolling net6.0-windows-Release-x64-CoreCLR_checked-Windows.10.Amd64.Open console.log runclient.py
1069328 runtime Rolling net6.0-windows-Release-x86-CoreCLR_release-Windows.10.Amd64.ServerRS5.Open console.log runclient.py
1068131 runtime PR 50257 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1066426 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump
1066426 runtime PR 50364 net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump
1066285 runtime PR 50507 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log runclient.py
1065570 runtime PR 50483 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1065167 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1015.Amd64.Open console.log core dump runclient.py
1063683 runtime PR 48527 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1059736 runtime PR 50232 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1059736 runtime PR 50232 net6.0-windows-Debug-x64-CoreCLR_checked-Windows.10.Amd64.Open console.log runclient.py
1059736 runtime PR 50232 net6.0-Linux-Debug-x64-CoreCLR_checked-Ubuntu.1804.Amd64.Open console.log core dump runclient.py
1059736 runtime PR 50232 net6.0-Linux-Debug-x64-CoreCLR_checked-(Alpine.312.Amd64.Open)ubuntu.1604.amd64.open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-20200602002622-e06dc59 console.log core dump runclient.py
1058168 runtime PR 50280 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1057844 runtime PR 50275 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1055598 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1050603 runtime PR 49990 net6.0-windows-Debug-x64-CoreCLR_checked-Windows.10.Amd64.Open console.log
1050314 runtime PR 48601 net6.0-windows-Debug-x64-CoreCLR_checked-Windows.10.Amd64.Open console.log runclient.py
1049560 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1047468 runtime PR 49545 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1047249 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1013.Amd64.Open console.log core dump runclient.py
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-mono_interpreter_release-Debian.9.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-(Centos.8.Amd64.Open)Ubuntu.1604.Amd64.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:centos-8-helix-20201229003624-c1bf759 console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-RedHat.7.Amd64.Open console.log core dump runclient.py
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-(Debian.10.Amd64.Open)ubuntu.1604.amd64.open@mcr.microsoft.com/dotnet-buildtools/prereqs:debian-10-helix-amd64-bfcd90a-20200121150006 console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-Ubuntu.1604.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-Ubuntu.1804.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-SLES.15.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-x64-Mono_release-(Fedora.30.Amd64.Open)ubuntu.1604.amd64.open@mcr.microsoft.com/dotnet-buildtools/prereqs:fedora-30-helix-20200512010621-4f8cef7 console.log core dump
1046026 runtime PR 49740 net6.0-Linux-Debug-arm64-Mono_release-(Ubuntu.1804.ArmArch.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-16.04-helix-arm64v8-20210106155927-56c6673 console.log core dump
1046026 runtime PR 49740 net6.0-OSX-Debug-x64-Mono_release-OSX.1014.Amd64.Open console.log core dump
1046026 runtime PR 49740 net6.0-OSX-Debug-x64-Mono_release-OSX.1015.Amd64.Open console.log core dump
1045010 runtime PR 48369 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log runclient.py
1044737 runtime PR 49373 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1015.Amd64.Open console.log core dump runclient.py
1044726 runtime PR 49532 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1043361 runtime PR 49747 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1042998 runtime PR 49741 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1041823 runtime PR 49251 net6.0-windows-Release-x86-CoreCLR_checked-Windows.10.Amd64.Open console.log runclient.py
1039858 runtime PR 43706 net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.312.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.12-helix-arm32v7-20200908125213-5bece88 console.log core dump
1039858 runtime PR 43706 net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440 console.log core dump
1039185 runtime Rolling net6.0-OSX-Release-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1038808 runtime PR 49613 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open console.log core dump runclient.py
1037652 runtime PR 49529 net6.0-OSX-Debug-x64-CoreCLR_checked-OSX.1013.Amd64.Open console.log core dump runclient.py
1037058 runtime PR 41097 net6.0-OSX-Debug-x64-CoreCLR_release-OSX.1015.Amd64.Open console.log core dump runclient.py
Build Result Summary Day Hit Count Week Hit Count Month Hit Count
3 4 33
stephentoub commented 4 years ago

Some Debug.Assert failure in System.Text.RegularExpressions.RegexCharClass

What is the assert? Was it recent? Was it on a PR related to regex or was it somewhere else?

stephentoub commented 4 years ago

That means there is more than 1 cause for the given ExitCode on osx.1013.amd64.open

Exit code 134 just means the process crashed, often because of an unhandled exception or fail fast from an assert. It maps to SIGABRT.

alnikola commented 4 years ago

System.Diagnostics.DebugProvider+DebugAssertException: RegexCharClass s_propTable is out of order at (IsAlphabeticPresentationForms, IsArabic)

stephentoub commented 4 years ago

Was that on a PR, and if so, which one?

alnikola commented 4 years ago

Yes, it failed on refs/pull/402/merge on 2020-05-10

alnikola commented 4 years ago

I used this query


WorkItems
| where FriendlyName == "System.IO.FileSystem.Watcher.Tests"
    and ExitCode == 134
    and Started > ago(30d)
    and QueueName =~ "osx.1013.amd64.open"
| join kind=inner Jobs on JobId
| order by Started desc
stephentoub commented 4 years ago

Thanks. That's a draft PR that's changing the implementation of string comparisons: https://github.com/dotnet/runtime/pull/402 It's not a real issue to be investigating. Tons of tests are broken in that PR.

BrennanConroy commented 4 years ago

We've started hitting this ObjectDisposedException in AspNetCore

wfurt commented 4 years ago

do you have core files @BrennanConroy ?

BrennanConroy commented 4 years ago

Looks like it, 3.5GB 😲 https://helix.dot.net/api/2019-06-17/jobs/0f38261c-d6ac-4ef7-a1cc-a520aaf5b030/workitems/Microsoft.AspNetCore.SignalR.Client.FunctionalTests--net5.0/files/core.87757

ericstj commented 4 years ago

This seems like it could also be cause of https://github.com/dotnet/runtime/issues/37629. @stephentoub what do you think?

BrennanConroy commented 4 years ago

@ericstj Can we move this back to 5.0 please

stephentoub commented 4 years ago

This seems like it could also be cause of #37629. @stephentoub what do you think?

It could be. It's hard to say for sure, but this could be evidence of the portion of my comment at https://github.com/dotnet/runtime/issues/30056#issuecomment-631099004: "But then the FileSystemEventCallback associated with this RunningInstance is invoked (which is dubious if nothing is keeping alive the delegate that was handed out to the native function)".

ericstj commented 4 years ago

@stephentoub I agree with your assessment, I think this should be fixed for reliability. @carlossanlop we should try to fit this in 5.0 if possible.

ericstj commented 4 years ago

One trick we can use to repro this is force the scenario @stephentoub describes and manually invoke the GC + WaitForPendingFinalizers. This can force the case where we are executing instance method of collected object.

eerhardt commented 3 years ago

Failed again in https://github.com/dotnet/runtime/pull/44548

https://dev.azure.com/dnceng/public/_build/results?buildId=883086&view=logs&j=e36ba317-696d-52ae-17e4-c1550b11ed0e&t=32c53736-4bfa-5de8-19e9-1c9b8f9d4787&l=58

https://helix.dot.net/api/2019-06-17/jobs/776b2649-2e0f-484d-9099-c6670159c3d9/workitems/System.IO.FileSystem.Watcher.Tests/console

/private/tmp/helix/working/9DDF08A4/w/BF320A67/e /private/tmp/helix/working/9DDF08A4/w/BF320A67/e
  Discovering: System.IO.FileSystem.Watcher.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.IO.FileSystem.Watcher.Tests (found 150 of 183 test cases)
  Starting:    System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Unhandled exception. System.ObjectDisposedException: Safe handle has been closed.
Object name: 'SafeHandle'.
   at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/SafeHandle.cs:line 148
   at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success) in /_/src/coreclr/src/System.Private.CoreLib/src/System/StubHelpers.cs:line 1255
   at Interop.EventStream.FSEventStreamUnscheduleFromRunLoop(SafeEventStreamHandle streamRef, IntPtr runLoop, SafeCreateHandle runLoopMode)
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.UnscheduleFromRunLoop(SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 221
   at System.IO.FileSystemWatcher.RunningInstance.CleanupEventStream() in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 277
   at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 353
./RunTests.sh: line 161: 22239 Abort trap: 6           (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.IO.FileSystem.Watcher.Tests.runtimeconfig.json --depsfile System.IO.FileSystem.Watcher.Tests.deps.json xunit.console.dll 
safern commented 3 years ago

And a dump was produced: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-44548-merge-776b26492e0f484d90/System.IO.FileSystem.Watcher.Tests/core.22239?sv=2019-07-07&se=2020-12-01T20%3A53%3A20Z&sr=c&sp=rl&sig=%2F9zqfSrsTS2iGkGZYGFeCnbLFQn6fIF3Phbu0sxP3rg%3D

Plus instructions on how to debug it: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-44548-merge-776b26492e0f484d90/System.IO.FileSystem.Watcher.Tests/how-to-debug-dump.md?sv=2019-07-07&se=2020-12-01T20%3A53%3A20Z&sr=c&sp=rl&sig=%2F9zqfSrsTS2iGkGZYGFeCnbLFQn6fIF3Phbu0sxP3rg%3D

ViktorHofer commented 3 years ago

Is someone following up on the SafeHandle ObjectDisposedException? Happened again in https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-44835-merge-c7c5501b02fc4bc9b9/System.IO.FileSystem.Watcher.Tests/console.16d3d036.log?sv=2019-07-07&se=2020-12-07T23%3A20%3A58Z&sr=c&sp=rl&sig=MI6jI0MFjRG2yXR5cnPvbQflr6ldFxG5gB2o425JXYo%3D.

wfurt commented 3 years ago

I'm planning to take a look but to is not not at the top priority right now. I downloaded the core dup @safern posted and briefly look at it - to verify I can.

GrabYourPitchforks commented 3 years ago

Happened again in https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-32520-merge-d75f106d61bc4c65b3/System.IO.FileSystem.Watcher.Tests/console.49f02a61.log?sv=2019-07-07&se=2020-12-15T05%3A56%3A26Z&sr=c&sp=rl&sig=CFgkw83J3V1Sj6YKA%2Fo6SZ1SQACKdExFXW%2FCdwfZ92M%3D

ViktorHofer commented 3 years ago

Happened again with dump and all:

core.13145: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-45480-merge-dde19b458b43432eb3/System.IO.FileSystem.Watcher.Tests/core.13145?sv=2019-07-07&se=2020-12-22T19%3A07%3A45Z&sr=c&sp=rl&sig=tDpCnBJFS37RIHP04exheAsDCJ9fH5vsdBUWccpJg3c%3D

how-to-debug-dump.md: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-45480-merge-dde19b458b43432eb3/System.IO.FileSystem.Watcher.Tests/how-to-debug-dump.md?sv=2019-07-07&se=2020-12-22T19%3A07%3A45Z&sr=c&sp=rl&sig=tDpCnBJFS37RIHP04exheAsDCJ9fH5vsdBUWccpJg3c%3D

console.0a20419c.log: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-45480-merge-dde19b458b43432eb3/System.IO.FileSystem.Watcher.Tests/console.0a20419c.log?sv=2019-07-07&se=2020-12-22T19%3A07%3A45Z&sr=c&sp=rl&sig=tDpCnBJFS37RIHP04exheAsDCJ9fH5vsdBUWccpJg3c%3D

https://dev.azure.com/dnceng/public/_build/results?buildId=906108&view=ms.vss-test-web.build-test-results-tab&runId=28910536&resultId=175456&paneView=attachments

ViktorHofer commented 3 years ago

Looking at runfo and only into rolling builds, this failed in 3 builds in the last 14 days: https://runfo.azurewebsites.net/search/tests/?bq=definition%3Aruntime+started%3A%7E14+kind%3Arolling&tq=name%3A%22System.IO.FileSystem.Watcher.Tests%22

danmoseley commented 3 years ago

It is happening ~8 times a month, across all macOS versions...

WorkItems
| where Finished > now(-90d)
| where FriendlyName startswith "System.IO.FileSystem.Watcher.Tests"
| where ExitCode == 134
| join kind=inner Jobs on JobId
| where Branch == "refs/heads/master"
| sort by Finished desc
| summarize count(), any(ConsoleUri) by bin(Finished, 1d), ExitCode//, QueueName
| render columnchart  with (ycolumns = count_, xcolumn = Finished, series=ExitCode, yaxis=log) //, QueueName
ViktorHofer commented 3 years ago

Happened again on https://dev.azure.com/dnceng/public/_build/results?buildId=998815&view=ms.vss-test-web.build-test-results-tab&runId=31247430&resultId=175805&paneView=attachments with dump and all again.

core.15975: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-48318-merge-06a5edf750084416b3/System.IO.FileSystem.Watcher.Tests/core.15975?sv=2019-07-07&se=2021-03-09T19%3A40%3A21Z&sr=c&sp=rl&sig=zexnOfVzrc%2FsAV2pPOAaGBX8eg4tGRcaVdlKhm5wH3g%3D

how-to-debug-dump.md: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-48318-merge-06a5edf750084416b3/System.IO.FileSystem.Watcher.Tests/how-to-debug-dump.md?sv=2019-07-07&se=2021-03-09T19%3A40%3A21Z&sr=c&sp=rl&sig=zexnOfVzrc%2FsAV2pPOAaGBX8eg4tGRcaVdlKhm5wH3g%3D

console.70bebe27.log: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-48318-merge-06a5edf750084416b3/System.IO.FileSystem.Watcher.Tests/console.70bebe27.log?sv=2019-07-07&se=2021-03-09T19%3A40%3A21Z&sr=c&sp=rl&sig=zexnOfVzrc%2FsAV2pPOAaGBX8eg4tGRcaVdlKhm5wH3g%3D

@wfurt have you had a chance to look into the dump?

jkoritzinsky commented 3 years ago

Another occurrence in https://dev.azure.com/dnceng/public/_build/results?buildId=1014122 with the dump and log and instructions.

ViktorHofer commented 3 years ago

Converted into a live tracking issue. Seems to still be happening quite often (twice per day if the data is right). As this has been happening for months I hope someone can take a look soon?

danmoseley commented 3 years ago

Including PR's this is now failing 2 days out of 3, using my query above with the branch removed. That's too high to postpone indefinitely, even if there aren't any customer reports.

@wfurt you were looking at it, do you plan to look at a dump? I know you're heavily booked.

carlossanlop commented 3 years ago

@stephentoub from the hypothesis:

The fix would be to root things appropriately with a GCHandle, and ensure as well that something with a finalizer is able to free that GCHandle (i.e. the GCHandle doesn't end up indirectly rooting itself).

I can give this a try.

Could you please confirm the following assumptions I'm making?:

I'm slightly skeptical of my theory only because I'd have expected things to go bad in multiple ways, and not just with regards to this SafeHandle be used after disposal, but maybe it just takes so many things to be lined up correctly that we just don't see it that often.

Can you please share any other concerns that you have about this theory?

Edit: I have a draft PR to test the above: https://github.com/dotnet/runtime/pull/51538

wfurt commented 3 years ago

While this is on my interest radar, I did not have chance to put any real effort into this @danmoseley. I did few attempt in the past to reproduce it locally but I was never able to trigger it.

danmoseley commented 3 years ago

@wfurt thanks, no problem. :)

ViktorHofer commented 3 years ago

Happened again in https://dev.azure.com/dnceng/public/_build/results?buildId=1101792&view=ms.vss-test-web.build-test-results-tab&runId=33675932&resultId=185078&paneView=attachments

core.19320:
  https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-51617-merge-d28ed1ecfe424ac9b6/System.IO.FileSystem.Watcher.Tests/core.19320?sv=2019-07-07&se=2021-05-12T12%3A26%3A15Z&sr=c&sp=rl&sig=QGBGMcJGRSVGl7mLjeQ0%2FO6G9BvH3KHvLGh8mZ3gZhI%3D

how-to-debug-dump.md:
  https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-51617-merge-d28ed1ecfe424ac9b6/System.IO.FileSystem.Watcher.Tests/how-to-debug-dump.md?sv=2019-07-07&se=2021-05-12T12%3A26%3A15Z&sr=c&sp=rl&sig=QGBGMcJGRSVGl7mLjeQ0%2FO6G9BvH3KHvLGh8mZ3gZhI%3D

console.23cebcee.log:
  https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-51617-merge-d28ed1ecfe424ac9b6/System.IO.FileSystem.Watcher.Tests/console.23cebcee.log?sv=2019-07-07&se=2021-05-12T12%3A26%3A15Z&sr=c&sp=rl&sig=QGBGMcJGRSVGl7mLjeQ0%2FO6G9BvH3KHvLGh8mZ3gZhI%3D
danmoseley commented 3 years ago

@carlossanlop how's your test go?

carlossanlop commented 3 years ago

@carlossanlop Carlos Sanchez Lopez FTE how's your test go?

The CI passed in my test, but I'm not 100% confident that this is the correct fix. Can I please get a review?

ViktorHofer commented 3 years ago

Unsure if this is the same issue. I see a NullReferenceException in FileSystemWatcher.RunningInstance.FileSystemEventCallback:

Unhandled Exception:
System.NullReferenceException: Object reference not set to an instance of an object
   at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 354
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(ManualResetEventSlim runLoopStarted, SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 255
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.<>c.<ScheduleEventStream>b__3_0(Object args) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 205
   at System.Threading.Thread.StartCallback() in /_/src/mono/System.Private.CoreLib/src/System/Threading/Thread.Mono.cs:line 236

core.28810: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-51891-merge-4df95a776f8845c08e/Microsoft.Extensions.Hosting.Unit.Tests/core.28810?sv=2019-07-07&se=2021-05-16T21%3A33%3A56Z&sr=c&sp=rl&sig=H2ZL3fnJrWeUiUgDH0X%2F%2Fph9du9TWqcIHWXCt0ta%2B2c%3D

console.fe66311c.log: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-51891-merge-4df95a776f8845c08e/Microsoft.Extensions.Hosting.Unit.Tests/console.fe66311c.log?sv=2019-07-07&se=2021-05-16T21%3A33%3A56Z&sr=c&sp=rl&sig=H2ZL3fnJrWeUiUgDH0X%2F%2Fph9du9TWqcIHWXCt0ta%2B2c%3D

Configuration: net6.0-OSX-Debug-x64-Mono_release-OSX.1015.Amd64.Open

danmoseley commented 3 years ago

@ViktorHofer not sure that's the same thing, but I don't know why it could be either -- there is no obvious way that code can ever NRE: https://github.com/dotnet/runtime/blob/e7b743cae325d4720fceaf7f84f68aaaf7bb7dbe/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs#L354

I see it segfaulted though -- is that what Mono test runs do when there's an unhandled exceptoin that brings down the process (like in this case) or a separate segfault?

danmoseley commented 3 years ago

Oh, I see @jkotas you believe this is a dupe?

jkotas commented 3 years ago

I think so.

eerhardt commented 3 years ago

@jkotas - I see the FileWatcher tests failing on OSX in this PR:

https://github.com/dotnet/runtime/pull/52252/checks?check_run_id=2502547808

runtime (Libraries Test Run release coreclr OSX x64 Debug) .packages/microsoft.dotnet.helix.sdk/6.0.0-beta.21222.1/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets#L78 .packages/microsoft.dotnet.helix.sdk/6.0.0-beta.21222.1/tools/Microsoft.DotNet.Helix.Sdk.MultiQueue.targets(78,5): error : (NETCORE_ENGINEERING_TELEMETRY=Test) Work item System.IO.FileSystem.Watcher.Tests in job d239ae94-91a2-4a9c-b9e1-2191863d7385 has failed. Failure log: https://helix.dot.net/api/2019-06-17/jobs/d239ae94-91a2-4a9c-b9e1-2191863d7385/workitems/System.IO.FileSystem.Watcher.Tests/console

/private/tmp/helix/working/AB6B08F2/w/A87108BA/e /private/tmp/helix/working/AB6B08F2/w/A87108BA/e
  Discovering: System.IO.FileSystem.Watcher.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.IO.FileSystem.Watcher.Tests (found 149 of 183 test cases)
  Starting:    System.IO.FileSystem.Watcher.Tests (parallel test collections = on, max threads = 4)
Process terminated. Assertion failed.
   at System.IO.FileSystemWatcher.RunningInstance.FileSystemEventCallback(IntPtr streamRef, IntPtr clientCallBackInfo, IntPtr numEvents, Byte** eventPaths, FSEventStreamEventFlags* eventFlags, UInt64* eventIds) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 386
   at Interop.RunLoop.CFRunLoopRun()
   at Interop.RunLoop.CFRunLoopRun()
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.WatchForFileSystemEventsThreadStart(ManualResetEventSlim runLoopStarted, SafeEventStreamHandle eventStream) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 254
   at System.IO.FileSystemWatcher.RunningInstance.StaticWatcherRunLoopManager.<>c.<ScheduleEventStream>b__3_0(Object args) in /_/src/libraries/System.IO.FileSystem.Watcher/src/System/IO/FileSystemWatcher.OSX.cs:line 204
   at System.Threading.Thread.StartCallback() in /_/src/coreclr/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs:line 105
./RunTests.sh: line 162:  4690 Abort trap: 6           (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.IO.FileSystem.Watcher.Tests.runtimeconfig.json --depsfile System.IO.FileSystem.Watcher.Tests.deps.json xunit.console.dll System.IO.FileSystem.Watcher.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/private/tmp/helix/working/AB6B08F2/w/A87108BA/e
----- end Tue May 4 09:55:46 PDT 2021 ----- exit code 134 ----------------------------------------------------------
exit code 134 means SIGABRT Abort. Managed or native assert, or runtime check such as heap corruption, caused call to abort(). Core dumped.
ulimit -c value: unlimited

Is that this same issue? I hope we got a dump of the process so we can figure the issue out.

jkotas commented 3 years ago

Yes, it is the same issue.

This assert says that the file watcher callback can be called even after the EventStream was stopped using FSEventStreamStop. So we have to be robust against that situation and use a different scheme to perform the cleanup.

jaredpar commented 3 years ago

I hope we got a dump of the process so we can figure the issue out.

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-52252-merge-d239ae9491a24a9cb9/System.IO.FileSystem.Watcher.Tests/core.4690?%3F%253Fsv%253D2019-07-07%2526se%253D2021-05-24T16%25253A53%25253A45Z%2526sr%253Dc%2526sp%253Drl%2526sig%253DsqT6WyXqnXJByjBQGYc3E1mXKv75bsK2y6ARQR3HD8Q%25253D

eerhardt commented 3 years ago

@jaredpar - for my education, how do you find that link?

jaredpar commented 3 years ago

Clicked on the checks tab and went to the AzDO build

https://dev.azure.com/dnceng/public/_build/results?buildId=1120772&view=logs&jobId=4724cd84-0a26-56bc-9b1d-ca1983f340b0

They key part there is the build id is 1120772. From there just ask runfo to dump the data

https://runfo.azurewebsites.net/view/build/?number=1120772

It will list all of the failed tests and for the FileWatcherTests there was a heap dump noted and linked to

image

The link I posted is that one.

jaredpar commented 3 years ago

Truthfully I was a bit lazier than that. I didn't click on the checks tab. Really I just went here

https://runfo.azurewebsites.net/view/pr-builds/?Repository=runtime&Number=52252

Basically that dumps all of the builds that happened for the PR #52252. I happened to know that test ran on the runtime definition so I click here

image

That is the link I posted above .

https://runfo.azurewebsites.net/view/build/?number=1120772

danmoseley commented 3 years ago

Even easier than that, in the dotnet/runtime repo we generate a file containing commands that you can copy/paste to get from zero to the debugger:

image

Try it out :)

jaredpar commented 3 years ago

Haha, can't debug though cause Windows does not like the core dump here

image

jaredpar commented 3 years ago

Even easier than that, in the dotnet/runtime repo we generate a file containing commands that you can copy/paste to get from zero to the debugger:

From PR to that display is seven clicks, runfo is two. And one of those clicks is a spinning wheel click waiting for that display to render :)

Joking aside I'm glad that display is getting better but it does take a lot of clicks to get to. It also takes time to render and is not 100% reliable. Found several times over the years where it won't display anything and errors out.

jkotas commented 3 years ago

Another attempt to fix this #52275