dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.91k stars 4.63k forks source link

BinderTracingTest.ResolutionFlow times out #104670

Open jakobbotsch opened 1 month ago

jakobbotsch commented 1 month ago

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=735589 Build error leg or test failing: Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd Pull request: https://github.com/dotnet/runtime/pull/104603

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "",
  "ErrorPattern": "BinderTracingTest.ResolutionFlow.* Timed Out",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=735589 Error message validated: [BinderTracingTest.ResolutionFlow.* Timed Out] Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 7/10/2024 11:13:33 AM UTC

Report

Build Definition Test Pull Request
782240 dotnet/runtime Loader/binding/tracing/BinderTracingTest.ResolutionFlow/BinderTracingTest.ResolutionFlow.cmd dotnet/runtime#106711

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 1
dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @dotnet/area-infrastructure-libraries See info in area-owners.md if you want to be subscribed.

jakobbotsch commented 1 month ago

cc @davmason

mdh1418 commented 1 month ago

This seems to be a continuation of https://github.com/dotnet/runtime/issues/97735 and https://github.com/dotnet/runtime/issues/94390. Given that the BinderTracingTests are already skipped for Jitstress and GCStress, there is probably another culprit for these tests timing out.

It seems flakey given the low hit count, and that previous counterpart #97735 at one point had 0 hit counts in 30 days.

I haven't been able to repro the timeout locally, and given how flakey it is on CI, I'm not sure if I'd be able to reliably repro in CI. From this build instance it seems like this is causing the hang https://github.com/dotnet/runtime/blob/ad25cd0126a8d4a060f2039f987bd58e7ca56a1d/src/tests/Loader/binding/tracing/BinderTracingTest.cs#L206 based off of

0:000> !clrstack -f -all
OS Thread Id: 0x23ec
Child SP       IP Call Site
0097E0EC 77B3F3EC ntdll!NtWaitForMultipleObjects + 12
0097E284 72246A85 coreclr!Thread::DoAppropriateAptStateWait + 199
0097E308 72246FB3 coreclr!Thread::DoAppropriateWaitWorker + 998
0097E464 7224D4AD coreclr!`Thread::DoAppropriateWait'::`9'::__Body::Run + 90
0097E4B8 72246B3C coreclr!Thread::DoAppropriateWait + 149
0097E51C 722BEC16 coreclr!WaitHandleNative::CorWaitOneNative + 294
0097E560          [HelperMethodFrame: 0097e560] System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOneCore(IntPtr, Int32, Boolean)
0097E5F4 71022A5C System_Private_CoreLib!System.Boolean System.Threading.WaitHandle::WaitOneNoCheck(System.Int32, System.Boolean, System.Object, System.Diagnostics.Tracing.NativeRuntimeEventSource+WaitHandleWaitSourceMap)$##6003CB2 + 188
0097E5F8 71022A5C System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOneNoCheck(Int32, Boolean, System.Object, WaitHandleWaitSourceMap) + 188
0097E630 71022984 System_Private_CoreLib!System.Boolean System.Threading.WaitHandle::WaitOne(System.Int32)$##6003CB1 + 20
0097E63C 71022984 System.Private.CoreLib.dll!System.Threading.WaitHandle.WaitOne(Int32) + 20
0097E644 094C5B7B system.diagnostics.process.dll!System.Diagnostics.Process.WaitForExitCore(Int32) + 123
0097E67C 094C0800 BinderTracingTest.ResolutionFlow.dll!BinderTracingTests.BinderTracingTest.RunTestInSeparateProcess(System.Reflection.MethodInfo) + 816
0097E6DC 0817237E BinderTracingTest.ResolutionFlow.dll!BinderTracingTests.BinderTracingTest.RunAllTests() + 446
0097E704 0817204F BinderTracingTest.ResolutionFlow.dll!BinderTracingTests.BinderTracingTest.Main(System.String[]) + 39

From looking at what BinderTracingTests does, I'm not quite sure what is causing the separate subprocess to hang. Given how this test is still hanging even without GCStress/Jitstress, I would've expected for the test to hit the BinderEventListener's 30s timeout at https://github.com/dotnet/runtime/blob/ad25cd0126a8d4a060f2039f987bd58e7ca56a1d/src/tests/Loader/binding/tracing/BinderEventListener.cs#L177-L178 if the test made it to https://github.com/dotnet/runtime/blob/ad25cd0126a8d4a060f2039f987bd58e7ca56a1d/src/tests/Loader/binding/tracing/BinderTracingTest.cs#L176

@elinor-fung / @davmason any other ideas on what might be causing the hang?

mdh1418 commented 1 month ago

762118 was from a PR that caused a deadlock, so it isn't the same cause as the first singluar hit in 735589

jakobbotsch commented 1 month ago

Hit in https://dev.azure.com/dnceng-public/public/_build/results?buildId=765128&view=results

mdh1418 commented 1 month ago

Hit in https://dev.azure.com/dnceng-public/public/_build/results?buildId=765128&view=results

Isn't that a jitstress pipeline? Why did the test get run there if it was marked JitOptimizationSensitive=true in https://github.com/dotnet/runtime/pull/102842?

In any case, the latest failure console log shows a hang presumable when waiting for this test https://github.com/dotnet/runtime/blob/ab03e0ffc0281c55cbe4ee02f025591f2fe4bf39/src/tests/Loader/binding/tracing/BinderTracingTest.ResolutionFlow.cs#L131 to finish running as a separate process (since the tests seem to be ran sequentially + FindInLoadContext_DefaultALC_IncompatibleVersion finished + the stack trace has system.diagnostics.process.dll!System.Diagnostics.Process.WaitForExitCore).

I guess the console.writelines from a subprocess don't actually get written immediately, given that we don't see either of these https://github.com/dotnet/runtime/blob/ab03e0ffc0281c55cbe4ee02f025591f2fe4bf39/src/tests/Loader/binding/tracing/BinderTracingTest.cs#L199-L202 given that the test hangs afterwards.

Also the subprocess dump doesn't get generated even though we are expecting to create a dump for all processes related to corerun.

@hoyosjs any ideas on what to tweak to be able to capture the dump for the hanging subprocess in this case?

jakobbotsch commented 1 month ago

Hit in https://dev.azure.com/dnceng-public/public/_build/results?buildId=765128&view=results

Isn't that a jitstress pipeline? Why did the test get run there if it was marked JitOptimizationSensitive=true in #102842?

The jitstress pipelines runs tests under many different configurations. This particular configuration does not set any of the "jitstress" environment variables, it only sets the following:

set DOTNET_TieredCompilation=0
set DOTNET_EnableAVX=0