dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.38k stars 4.75k forks source link

tailcall_v4/hijacking/hijacking.cmd Timed Out #106134

Open lewing opened 3 months ago

lewing commented 3 months ago

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=765968 Build error leg or test failing: JIT/Methodical/tailcall_v4/hijacking/hijacking.cmd Pull request: https://github.com/dotnet/runtime/pull/105868

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "tailcall_v4/hijacking/hijacking.cmd Timed Out",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=765968 Error message validated: [tailcall_v4/hijacking/hijacking.cmd Timed Out] Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 8/8/2024 3:07:27 PM UTC

Report

Build Definition Test Pull Request
863269 dotnet/runtime JIT/Methodical/tailcall_v4/hijacking/hijacking.cmd dotnet/runtime#109605

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 1
jakobbotsch commented 3 months ago

Removing blocking from this since we haven't seen other failures.

jakobbotsch commented 3 months ago

This looks like some failure to suspend the EE. The test runs two threads for 30 seconds: one that continuously performs helper-based tailcalls, and another that does GCs. The stack trace seems to indicate that most threads are captured and waiting on the one thread doing portable tailcalls to be suspended. Interestingly it seems like windbg is not able to unwind fully out of it, but unsure if that is related:

0:005> k
 # Child-SP          RetAddr               Call Site
00 000000c5`36f7ef98 00007ffc`a26e70c6     coreclr!memset+0x15b [D:\a\_work\1\s\src\vctools\crt\vcruntime\src\string\amd64\memset.asm @ 277] 
01 000000c5`36f7efa0 00007ffc`a29a2f5f     coreclr!Thread::ObjectRefFlush+0x96 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 5785] 
02 000000c5`36f7efd0 00007ffc`a1367f20     coreclr!TailCallHelp::GetTailCallInfo+0x7f [D:\a\_work\1\s\src\coreclr\vm\tailcallhelp.cpp @ 34] 
03 000000c5`36f7f040 00007ffc`433622e7     System_Private_CoreLib+0x507f20
04 000000c5`36f7f0b0 00000000`00000004     0x00007ffc`433622e7
05 000000c5`36f7f0b8 00007ffc`a2f0bd50     0x4
06 000000c5`36f7f0c0 0000023c`edad66a0     coreclr!EEContract::`vftable'
07 000000c5`36f7f0c8 000000c5`36f7f4f0     0x0000023c`edad66a0
08 000000c5`36f7f0d0 000000c5`36f7f130     0x000000c5`36f7f4f0
09 000000c5`36f7f0d8 000000c5`36f7f128     0x000000c5`36f7f130
0a 000000c5`36f7f0e0 00000000`47525440     0x000000c5`36f7f128
0b 000000c5`36f7f0e8 00007ffc`43362269     0x47525440
0c 000000c5`36f7f0f0 000000c5`36f7f130     0x00007ffc`43362269
0d 000000c5`36f7f0f8 000000c5`36f7f128     0x000000c5`36f7f130
0e 000000c5`36f7f100 00000000`4752543f     0x000000c5`36f7f128
0f 000000c5`36f7f108 00007ffc`433623e8     0x4752543f
10 000000c5`36f7f110 000000c5`36f7f130     0x00007ffc`433623e8
11 000000c5`36f7f118 000000c5`36f7f128     0x000000c5`36f7f130
12 000000c5`36f7f120 00000000`4752543f     0x000000c5`36f7f128
13 000000c5`36f7f128 000000c5`36f7f130     0x4752543f
14 000000c5`36f7f130 000000c5`36f7f128     0x000000c5`36f7f130
15 000000c5`36f7f138 00000000`47525440     0x000000c5`36f7f128
16 000000c5`36f7f140 00007ffc`43750468     0x47525440
17 000000c5`36f7f148 00007ffc`a1367f5e     0x00007ffc`43750468
18 000000c5`36f7f150 00007ffc`43361fb7     System_Private_CoreLib+0x507f5e
19 000000c5`36f7f1c0 000000c5`36f7f1f8     0x00007ffc`43361fb7
1a 000000c5`36f7f1c8 00007ffc`43750360     0x000000c5`36f7f1f8
1b 000000c5`36f7f1d0 00000000`00000000     0x00007ffc`43750360

Another thing that strikes me as odd after looking at the dump for a bit is that the thread doing portable tailcalls has Thread::TS_GCSuspendRedirected thread state set even though it is in cooperative mode. If I'm reading the code right, it means that all the hijacking attempts on it are not doing anything. @VSadov can you take a look?

VSadov commented 3 months ago

Was this on win-x86?

jakobbotsch commented 3 months ago

Was this on win-x86?

No, the failures are from win-x64. I haven't been able to repro this locally, so I've only looked at a dump.

VSadov commented 3 months ago

TS_GCSuspendRedirected means we changed the thread context to jump to suspension routine and released. Somehow the thread ignored that and continues running app code?

This style of suspension is used when no signal-like API is available. So likely win-x86 specific.

VSadov commented 3 months ago

The other possibility is the state was set by accident, maybe some uninitialized data was involved.

I’ll take a look.

dotnet-policy-service[bot] commented 3 months ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

jakobbotsch commented 3 months ago

This style of suspension is used when no signal-like API is available. So likely win-x86 specific.

In the dump I can see that Thread::s_pfnQueueUserAPC2Proc == 0, so presumably that's why it is using this suspension method? Perhaps the problem can be reproduced by forcing this mode.

v-wenyuxu commented 2 months ago

Failed in: runtime-coreclr outerloop 20240913.3

Failed tests:

R2R-CG2 windows x64 Checked no_tiered_compilation @ Windows.10.Amd64.Open
    - JIT/Methodical/tailcall_v4/hijacking/hijacking.cmd

Error message:


cmdLine:C:hw909107FEwAD530939eJITMethodicalMethodical_others../tailcall_v4/hijacking/hijacking.cmd Timed Out (timeout in milliseconds: 1800000 from variable __TestTimeout, start: 9/13/2024 8:10:56 PM, end: 9/13/2024 8:40:56 PM)

Return code:      -100
Raw output file:      C:hw909107FEwAD530939uploads  ailcall_v4hijackingoutput.txt
Raw output:
BEGIN EXECUTION
hijacking.dll
        1 file(s) copied.
20:10:56.30
Response file: C:hw909107FEwAD530939eJITMethodical  ailcall_v4hijackinghijacking.dll.rsp
C:hw909107FEwAD530939eJITMethodical ailcall_v4hijackingIL-CG2hijacking.dll
-o:C:hw909107FEwAD530939eJITMethodical  ailcall_v4hijackinghijacking.dll
--targetarch:x64
--targetos:windows
--verify-type-and-field-layout
--method-layout:random
-r:C:hw909107FEpSystem.*.dll
-r:C:hw909107FEpMicrosoft.*.dll
-r:C:hw909107FEp\xunit.*.dll
-r:C:hw909107FEpmscorlib.dll
-r:C:hw909107FEp
etstandard.dll
" "C:hw909107FEpcrossgen2crossgen2.exe" @"C:hw909107FEwAD530939eJITMethodical   ailcall_v4hijackinghijacking.dll.rsp"   -r:C:hw909107FEwAD530939eJITMethodicalMethodical_othersIL-CG2*.dll  -r:C:hw909107FEwAD530939eJITMethodical  ailcall_v4hijackingIL-CG2*.dll"
Emitting R2R PE file: C:hw909107FEwAD530939eJITMethodical   ailcall_v4hijackinghijacking.dll
" "dotnet" "C:hw909107FEp
2rdump
2rdump.dll" --header --sc --in C:hw909107FEwAD530939eJITMethodical  ailcall_v4hijackinghijacking.dll --out C:hw909107FEwAD530939eJITMethodical  ailcall_v4hijackinghijacking.dll.r2rdump --val"
20:10:56.80
 "C:hw909107FEpcorerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false" -p "System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization=true"  hijacking.dll 

cmdLine:C:hw909107FEwAD530939eJITMethodicalMethodical_others../tailcall_v4/hijacking/hijacking.cmd Timed Out (timeout in milliseconds: 1800000 from variable __TestTimeout, start: 9/13/2024 8:10:56 PM, end: 9/13/2024 8:40:56 PM)
Invoking: C:Program Files (x86)Windows Kits10Debuggersdcdb.exe -c "$<C:hw909107FE       mpkrpkmd.tmp" -z "C:corescrashdump_2080.dmp"
stdout: 
Microsoft (R) Windows Debugger Version 10.0.18362.1 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:corescrashdump_2080.dmp]
User Mini Dump File with Full Memory: Only application data is available

************* Path validation summary **************
Response                         Time (ms)     Location
OK                                             C:hw909107FEpPDB
Symbol search path is: C:hw909107FEpPDB
Executable search path is: 
Windows 10 Version 14393 MP (4 procs) Free x64
Product: Server, suite: TerminalServer DataCenter SingleUserTS
10.0.14393.6343 (rs1_release.230913-1727)
Machine Name:
Debug session time: Fri Sep 13 20:40:56.000 2024 (UTC + 0:00)
System Uptime: 0 days 0:50:23.232
Process Uptime: 0 days 0:30:00.000
............................

************* Symbol Loading Error Summary **************
Module name            Error
ntdll                  The system cannot find the file specified

You can troubleshoot most symbol related issues by turning on symbol loading diagnostics (!sym noisy) and repeating the command that caused symbols to be loaded.
You should also verify that your symbol search path (.sympath) is correct.
For analysis of this file, run !analyze -v
ntdll!ZwWaitForSingleObject+0x14:
00007ffe`14196124 c3              ret
0:000> cdb: Reading initial command '$<C:hw909107FE     mpkrpkmd.tmp'
0:000> .load C:Users
unner.dotnetsossos.dll
0:000> ~*k

.  0  Id: 820.13c0 Suspend: 0 Teb: 00000018`1924b000 Unfrozen
Child-SP          RetAddr           Call Site
00000018`1957dd08 00007ffe`106d6d1f ntdll!ZwWaitForSingleObject+0x14
*** WARNING: Unable to verify checksum for coreclr.dll
00000018`1957dd10 00007ffd`e58063ef KERNELBASE!WaitForSingleObjectEx+0x8f
00000018`1957ddb0 00007ffd`e56d764b coreclr!Thread::RareDisablePreemptiveGC+0x27f
*** WARNING: Unable to verify checksum for System.Private.CoreLib.dll
00000018`1957de80 00007ffd`e4134e3f coreclr!JIT_PInvokeEndRarePath+0x10b
*** W

Stack trace:

   at Xunit.Assert.True(Nullable`1 condition, String userMessage) in /_/src/Microsoft.DotNet.XUnitAssert/src/BooleanAsserts.cs:line 146
   at Xunit.Assert.True(Boolean condition, String userMessage) in /_/src/Microsoft.DotNet.XUnitAssert/src/BooleanAsserts.cs:line 128
   at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String assemblyPath, String testPathPrefix)
   at Program.<<Main>$>g__TestExecutor55|2_56(StreamWriter tempLogSw, StreamWriter statsCsvSw, <>c__DisplayClass2_0&)
VSadov commented 2 months ago

This did not reproduce in more than a month. It is possible it was a sideeffect of some other issue, which is now fixed.

VSadov commented 2 months ago

I think this we can keep this open - in case it reproduces, but put it into 10.0 as it does not seem blocking.