dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.48k stars 4.76k forks source link

BinderTracingTest.ResolutionFlow times out #104670

Open jakobbotsch opened 4 months ago

jakobbotsch commented 4 months ago

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=735589 Build error leg or test failing: Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd Pull request: https://github.com/dotnet/runtime/pull/104603

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "",
  "ErrorPattern": "BinderTracingTest.ResolutionFlow.* Timed Out",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=735589 Error message validated: [BinderTracingTest.ResolutionFlow.* Timed Out] Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 7/10/2024 11:13:33 AM UTC

Report

Build Definition Test Pull Request
879584 dotnet/runtime Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd dotnet/runtime#110117
875008 dotnet/runtime Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd dotnet/runtime#109998
869615 dotnet/runtime Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 3
dotnet-policy-service[bot] commented 4 months ago

Tagging subscribers to this area: @dotnet/area-infrastructure-libraries See info in area-owners.md if you want to be subscribed.

jakobbotsch commented 4 months ago

cc @davmason

mdh1418 commented 4 months ago

This seems to be a continuation of https://github.com/dotnet/runtime/issues/97735 and https://github.com/dotnet/runtime/issues/94390. Given that the BinderTracingTests are already skipped for Jitstress and GCStress, there is probably another culprit for these tests timing out.

It seems flakey given the low hit count, and that previous counterpart #97735 at one point had 0 hit counts in 30 days.

I haven't been able to repro the timeout locally, and given how flakey it is on CI, I'm not sure if I'd be able to reliably repro in CI. From this build instance it seems like this is causing the hang https://github.com/dotnet/runtime/blob/ad25cd0126a8d4a060f2039f987bd58e7ca56a1d/src/tests/Loader/binding/tracing/BinderTracingTest.cs#L206 based off of

0:000> !clrstack -f -all
OS Thread Id: 0x23ec
Child SP       IP Call Site
0097E0EC 77B3F3EC ntdll!NtWaitForMultipleObjects + 12
0097E284 72246A85 coreclr!Thread::DoAppropriateAptStateWait + 199
0097E308 72246FB3 coreclr!Thread::DoAppropriateWaitWorker + 998
0097E464 7224D4AD coreclr!`Thread::DoAppropriateWait'::`9'::__Body::Run + 90
0097E4B8 72246B3C coreclr!Thread::DoAppropriateWait + 149
0097E51C 722BEC16 coreclr!WaitHandleNative::CorWaitOneNative + 294
0097E560          [HelperMethodFrame: 0097e560] System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOneCore(IntPtr, Int32, Boolean)
0097E5F4 71022A5C System_Private_CoreLib!System.Boolean System.Threading.WaitHandle::WaitOneNoCheck(System.Int32, System.Boolean, System.Object, System.Diagnostics.Tracing.NativeRuntimeEventSource+WaitHandleWaitSourceMap)$##6003CB2 + 188
0097E5F8 71022A5C System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOneNoCheck(Int32, Boolean, System.Object, WaitHandleWaitSourceMap) + 188
0097E630 71022984 System_Private_CoreLib!System.Boolean System.Threading.WaitHandle::WaitOne(System.Int32)$##6003CB1 + 20
0097E63C 71022984 System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOne(Int32) + 20
0097E644 094C5B7B system.diagnostics.process.dll!System.Diagnostics.Process.WaitForExitCore(Int32) + 123
0097E67C 094C0800 BinderTracingTest.ResolutionFlow.dll!BinderTracingTests.BinderTracingTest.RunTestInSeparateProcess(System.Reflection.MethodInfo) + 816
0097E6DC 0817237E BinderTracingTest.ResolutionFlow.dll!BinderTracingTests.BinderTracingTest.RunAllTests() + 446
0097E704 0817204F BinderTracingTest.ResolutionFlow.dll!BinderTracingTests.BinderTracingTest.Main(System.String[]) + 39

From looking at what BinderTracingTests does, I'm not quite sure what is causing the separate subprocess to hang. Given how this test is still hanging even without GCStress/Jitstress, I would've expected for the test to hit the BinderEventListener's 30s timeout at https://github.com/dotnet/runtime/blob/ad25cd0126a8d4a060f2039f987bd58e7ca56a1d/src/tests/Loader/binding/tracing/BinderEventListener.cs#L177-L178 if the test made it to https://github.com/dotnet/runtime/blob/ad25cd0126a8d4a060f2039f987bd58e7ca56a1d/src/tests/Loader/binding/tracing/BinderTracingTest.cs#L176

@elinor-fung / @davmason any other ideas on what might be causing the hang?

mdh1418 commented 4 months ago

762118 was from a PR that caused a deadlock, so it isn't the same cause as the first singluar hit in 735589

jakobbotsch commented 4 months ago

Hit in https://dev.azure.com/dnceng-public/public/_build/results?buildId=765128&view=results

mdh1418 commented 4 months ago

Hit in https://dev.azure.com/dnceng-public/public/_build/results?buildId=765128&view=results

Isn't that a jitstress pipeline? Why did the test get run there if it was marked JitOptimizationSensitive=true in https://github.com/dotnet/runtime/pull/102842?

In any case, the latest failure console log shows a hang presumable when waiting for this test https://github.com/dotnet/runtime/blob/ab03e0ffc0281c55cbe4ee02f025591f2fe4bf39/src/tests/Loader/binding/tracing/BinderTracingTest.ResolutionFlow.cs#L131 to finish running as a separate process (since the tests seem to be ran sequentially + FindInLoadContext_DefaultALC_IncompatibleVersion finished + the stack trace has system.diagnostics.process.dll!System.Diagnostics.Process.WaitForExitCore).

I guess the console.writelines from a subprocess don't actually get written immediately, given that we don't see either of these https://github.com/dotnet/runtime/blob/ab03e0ffc0281c55cbe4ee02f025591f2fe4bf39/src/tests/Loader/binding/tracing/BinderTracingTest.cs#L199-L202 given that the test hangs afterwards.

Also the subprocess dump doesn't get generated even though we are expecting to create a dump for all processes related to corerun.

@hoyosjs any ideas on what to tweak to be able to capture the dump for the hanging subprocess in this case?

jakobbotsch commented 4 months ago

Hit in https://dev.azure.com/dnceng-public/public/_build/results?buildId=765128&view=results

Isn't that a jitstress pipeline? Why did the test get run there if it was marked JitOptimizationSensitive=true in #102842?

The jitstress pipelines runs tests under many different configurations. This particular configuration does not set any of the "jitstress" environment variables, it only sets the following:

set DOTNET_TieredCompilation=0
set DOTNET_EnableAVX=0
mdh1418 commented 2 weeks ago

From the latest failure's console log, the test spins up a subprocess with PID 920

09:26:42.243 Running test: Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd
Attempting to collect crash dump: C:\cores\crashdump_3068.dmp
Collected crash dump: C:\cores\crashdump_3068.dmp

cmdLine:C:\h\w\A46C08CB\w\BB700A39\e\Loader\Loader\../binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd Timed Out (timeout in milliseconds: 600000 from variable __TestTimeout, start: 11/14/2024 9:26:42 AM, end: 11/14/2024 9:36:42 AM)

Return code:      -100
Raw output file:      C:\h\w\A46C08CB\w\BB700A39\uploads\binding\tracing\BinderTracingTest.ResolutionFlow\output.txt
Raw output:
BEGIN EXECUTION
 "C:\h\w\A46C08CB\p\corerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false" -p "System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization=true"  BinderTracingTest.ResolutionFlow.dll 
[9:26:43 AM] Running AssemblyLoadContextResolving_ReturnNull...
[9:26:44 AM] Invoking AssemblyLoadContextResolving_ReturnNull...
[9:26:44 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
Waiting for bind events for AssemblyToLoad_Subdirectory (50ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test AssemblyLoadContextResolving_ReturnNull finished.
[9:26:45 AM] Running AssemblyLoadContextResolving_LoadAssembly...
[9:26:45 AM] Invoking AssemblyLoadContextResolving_LoadAssembly...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Started
ValidateBindOperation Finished
ValidateBindOperation Finished
Test AssemblyLoadContextResolving_LoadAssembly finished.
[9:26:45 AM] Running AssemblyLoadContextResolving_NameMismatch...
[9:26:45 AM] Invoking AssemblyLoadContextResolving_NameMismatch...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Started
ValidateBindOperation Finished
ValidateBindOperation Finished
Test AssemblyLoadContextResolving_NameMismatch finished.
[9:26:45 AM] Running AssemblyLoadContextResolving_MultipleHandlers...
[9:26:45 AM] Invoking AssemblyLoadContextResolving_MultipleHandlers...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
ValidateBindOperation Started
ValidateBindOperation Started
ValidateBindOperation Finished
ValidateBindOperation Finished
Test AssemblyLoadContextResolving_MultipleHandlers finished.
[9:26:45 AM] Running AppDomainAssemblyResolve_ReturnNull...
[9:26:45 AM] Invoking AppDomainAssemblyResolve_ReturnNull...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test AppDomainAssemblyResolve_ReturnNull finished.
[9:26:45 AM] Running AppDomainAssemblyResolve_LoadAssembly...
[9:26:45 AM] Invoking AppDomainAssemblyResolve_LoadAssembly...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Started
ValidateBindOperation Finished
ValidateBindOperation Finished
Test AppDomainAssemblyResolve_LoadAssembly finished.
[9:26:45 AM] Running AppDomainAssemblyResolve_NameMismatch...
[9:26:45 AM] Invoking AppDomainAssemblyResolve_NameMismatch...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Started
ValidateBindOperation Finished
ValidateBindOperation Finished
Test AppDomainAssemblyResolve_NameMismatch finished.
[9:26:45 AM] Running AppDomainAssemblyResolve_MultipleHandlers...
[9:26:45 AM] Invoking AppDomainAssemblyResolve_MultipleHandlers...
[9:26:45 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Started
ValidateBindOperation Finished
ValidateBindOperation Finished
Test AppDomainAssemblyResolve_MultipleHandlers finished.
[9:26:45 AM] Launching process for AssemblyLoadFromResolveHandler_LoadDependency...
Started subprocess 4652 for AssemblyLoadFromResolveHandler_LoadDependency...
[9:26:47 AM] Running AssemblyLoadFromResolveHandler_LoadDependency...
[9:26:47 AM] Invoking AssemblyLoadFromResolveHandler_LoadDependency...
[9:26:48 AM] Validating bind operation for AssemblyToLoadDependency, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null...
Waiting for bind events for AssemblyToLoadDependency (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test AssemblyLoadFromResolveHandler_LoadDependency finished.

[9:26:48 AM] Launching process for AssemblyLoadFromResolveHandler_NotTracked...
Started subprocess 3304 for AssemblyLoadFromResolveHandler_NotTracked...
[9:26:49 AM] Running AssemblyLoadFromResolveHandler_NotTracked...
[9:26:50 AM] Invoking AssemblyLoadFromResolveHandler_NotTracked...
[9:26:50 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
Waiting for bind events for AssemblyToLoad_Subdirectory (50ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test AssemblyLoadFromResolveHandler_NotTracked finished.

[9:26:50 AM] Launching process for FindInLoadContext_CustomALC...
Started subprocess 1776 for FindInLoadContext_CustomALC...
[9:26:51 AM] Running FindInLoadContext_CustomALC...
[9:26:52 AM] Invoking FindInLoadContext_CustomALC...
[9:26:52 AM] Validating bind operation for AssemblyToLoad_Subdirectory...
Waiting for bind events for AssemblyToLoad_Subdirectory (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test FindInLoadContext_CustomALC finished.

[9:26:53 AM] Launching process for FindInLoadContext_DefaultALC...
Started subprocess 5352 for FindInLoadContext_DefaultALC...
[9:26:54 AM] Running FindInLoadContext_DefaultALC...
[9:26:55 AM] Invoking FindInLoadContext_DefaultALC...
[9:26:55 AM] Validating bind operation for AssemblyToLoad...
Waiting for bind events for AssemblyToLoad (0ms waited) Sleeping for 50ms
Waiting for bind events for AssemblyToLoad (50ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test FindInLoadContext_DefaultALC finished.

[9:26:56 AM] Launching process for FindInLoadContext_DefaultALC_IncompatibleVersion...
Started subprocess 920 for FindInLoadContext_DefaultALC_IncompatibleVersion...
[9:26:57 AM] Running FindInLoadContext_DefaultALC_IncompatibleVersion...
[9:26:58 AM] Invoking FindInLoadContext_DefaultALC_IncompatibleVersion...
[9:26:58 AM] Validating bind operation for AssemblyToLoad, Version=4.3.2.1...
Waiting for bind events for AssemblyToLoad (0ms waited) Sleeping for 50ms
ValidateBindOperation Started
ValidateBindOperation Finished
Test FindInLoadContext_DefaultALC_IncompatibleVersion finished.

cmdLine:C:\h\w\A46C08CB\w\BB700A39\e\Loader\Loader\../binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd Timed Out (timeout in milliseconds: 600000 from variable __TestTimeout, start: 11/14/2024 9:26:42 AM, end: 11/14/2024 9:36:42 AM)
Invoking: C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\cdb.exe -c "$<C:\h\w\A46C08CB\t\tmppytb12.tmp" -z "C:\cores\crashdump_3068.dmp"

and that subprocess appears to complete RunSingleTest based on Test FindInLoadContext_DefaultALC_IncompatibleVersion finished. https://github.com/dotnet/runtime/blob/e3b3aaaf21b88b7992cd8a42321e01cf1aa704a2/src/tests/Loader/binding/tracing/BinderTracingTest.cs#L186

Moreover, at the end of the console log, there is

[XUnitLogChecker]: 09:39:44.85: The full run will be done.
[XUnitLogChecker]: 09:39:44.85: Item 'Loader' did complete successfully!
ERROR: The process "corerun.exe" with PID 920 could not be terminated.
Reason: There is no running instance of the task.

Could this be a bug with Process.WaitForExit() where the subprocess fails to signal to the parent process that it finished? In any case, its not apparent to me that this is a tracing issue. Its odd that our logic to collect dumps from all childprocesses isn't generating a dump for the PID 920 that the test is supposedly stuck on. Until we collect a crashdump from the child process that the test is stuck on, it'll be hard to figure out what is causing the test to hang.

dotnet-policy-service[bot] commented 2 weeks ago

Tagging subscribers to this area: @dotnet/area-system-diagnostics-process See info in area-owners.md if you want to be subscribed.