Open lewing opened 3 months ago
Removing blocking from this since we haven't seen other failures.
This looks like some failure to suspend the EE. The test runs two threads for 30 seconds: one that continuously performs helper-based tailcalls, and another that does GCs. The stack trace seems to indicate that most threads are captured and waiting on the one thread doing portable tailcalls to be suspended. Interestingly it seems like windbg is not able to unwind fully out of it, but unsure if that is related:
0:005> k
# Child-SP RetAddr Call Site
00 000000c5`36f7ef98 00007ffc`a26e70c6 coreclr!memset+0x15b [D:\a\_work\1\s\src\vctools\crt\vcruntime\src\string\amd64\memset.asm @ 277]
01 000000c5`36f7efa0 00007ffc`a29a2f5f coreclr!Thread::ObjectRefFlush+0x96 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 5785]
02 000000c5`36f7efd0 00007ffc`a1367f20 coreclr!TailCallHelp::GetTailCallInfo+0x7f [D:\a\_work\1\s\src\coreclr\vm\tailcallhelp.cpp @ 34]
03 000000c5`36f7f040 00007ffc`433622e7 System_Private_CoreLib+0x507f20
04 000000c5`36f7f0b0 00000000`00000004 0x00007ffc`433622e7
05 000000c5`36f7f0b8 00007ffc`a2f0bd50 0x4
06 000000c5`36f7f0c0 0000023c`edad66a0 coreclr!EEContract::`vftable'
07 000000c5`36f7f0c8 000000c5`36f7f4f0 0x0000023c`edad66a0
08 000000c5`36f7f0d0 000000c5`36f7f130 0x000000c5`36f7f4f0
09 000000c5`36f7f0d8 000000c5`36f7f128 0x000000c5`36f7f130
0a 000000c5`36f7f0e0 00000000`47525440 0x000000c5`36f7f128
0b 000000c5`36f7f0e8 00007ffc`43362269 0x47525440
0c 000000c5`36f7f0f0 000000c5`36f7f130 0x00007ffc`43362269
0d 000000c5`36f7f0f8 000000c5`36f7f128 0x000000c5`36f7f130
0e 000000c5`36f7f100 00000000`4752543f 0x000000c5`36f7f128
0f 000000c5`36f7f108 00007ffc`433623e8 0x4752543f
10 000000c5`36f7f110 000000c5`36f7f130 0x00007ffc`433623e8
11 000000c5`36f7f118 000000c5`36f7f128 0x000000c5`36f7f130
12 000000c5`36f7f120 00000000`4752543f 0x000000c5`36f7f128
13 000000c5`36f7f128 000000c5`36f7f130 0x4752543f
14 000000c5`36f7f130 000000c5`36f7f128 0x000000c5`36f7f130
15 000000c5`36f7f138 00000000`47525440 0x000000c5`36f7f128
16 000000c5`36f7f140 00007ffc`43750468 0x47525440
17 000000c5`36f7f148 00007ffc`a1367f5e 0x00007ffc`43750468
18 000000c5`36f7f150 00007ffc`43361fb7 System_Private_CoreLib+0x507f5e
19 000000c5`36f7f1c0 000000c5`36f7f1f8 0x00007ffc`43361fb7
1a 000000c5`36f7f1c8 00007ffc`43750360 0x000000c5`36f7f1f8
1b 000000c5`36f7f1d0 00000000`00000000 0x00007ffc`43750360
Another thing that strikes me as odd after looking at the dump for a bit is that the thread doing portable tailcalls has Thread::TS_GCSuspendRedirected
thread state set even though it is in cooperative mode. If I'm reading the code right, it means that all the hijacking attempts on it are not doing anything.
@VSadov can you take a look?
Was this on win-x86?
Was this on win-x86?
No, the failures are from win-x64. I haven't been able to repro this locally, so I've only looked at a dump.
TS_GCSuspendRedirected means we changed the thread context to jump to suspension routine and released. Somehow the thread ignored that and continues running app code?
This style of suspension is used when no signal-like API is available. So likely win-x86 specific.
The other possibility is the state was set by accident, maybe some uninitialized data was involved.
I’ll take a look.
Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.
This style of suspension is used when no signal-like API is available. So likely win-x86 specific.
In the dump I can see that Thread::s_pfnQueueUserAPC2Proc == 0
, so presumably that's why it is using this suspension method? Perhaps the problem can be reproduced by forcing this mode.
Failed in: runtime-coreclr outerloop 20240913.3
Failed tests:
R2R-CG2 windows x64 Checked no_tiered_compilation @ Windows.10.Amd64.Open
- JIT/Methodical/tailcall_v4/hijacking/hijacking.cmd
Error message:
cmdLine:C:hw909107FEwAD530939eJITMethodicalMethodical_others../tailcall_v4/hijacking/hijacking.cmd Timed Out (timeout in milliseconds: 1800000 from variable __TestTimeout, start: 9/13/2024 8:10:56 PM, end: 9/13/2024 8:40:56 PM)
Return code: -100
Raw output file: C:hw909107FEwAD530939uploads ailcall_v4hijackingoutput.txt
Raw output:
BEGIN EXECUTION
hijacking.dll
1 file(s) copied.
20:10:56.30
Response file: C:hw909107FEwAD530939eJITMethodical ailcall_v4hijackinghijacking.dll.rsp
C:hw909107FEwAD530939eJITMethodical ailcall_v4hijackingIL-CG2hijacking.dll
-o:C:hw909107FEwAD530939eJITMethodical ailcall_v4hijackinghijacking.dll
--targetarch:x64
--targetos:windows
--verify-type-and-field-layout
--method-layout:random
-r:C:hw909107FEpSystem.*.dll
-r:C:hw909107FEpMicrosoft.*.dll
-r:C:hw909107FEp\xunit.*.dll
-r:C:hw909107FEpmscorlib.dll
-r:C:hw909107FEp
etstandard.dll
" "C:hw909107FEpcrossgen2crossgen2.exe" @"C:hw909107FEwAD530939eJITMethodical ailcall_v4hijackinghijacking.dll.rsp" -r:C:hw909107FEwAD530939eJITMethodicalMethodical_othersIL-CG2*.dll -r:C:hw909107FEwAD530939eJITMethodical ailcall_v4hijackingIL-CG2*.dll"
Emitting R2R PE file: C:hw909107FEwAD530939eJITMethodical ailcall_v4hijackinghijacking.dll
" "dotnet" "C:hw909107FEp
2rdump
2rdump.dll" --header --sc --in C:hw909107FEwAD530939eJITMethodical ailcall_v4hijackinghijacking.dll --out C:hw909107FEwAD530939eJITMethodical ailcall_v4hijackinghijacking.dll.r2rdump --val"
20:10:56.80
"C:hw909107FEpcorerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false" -p "System.Runtime.Serialization.EnableUnsafeBinaryFormatterSerialization=true" hijacking.dll
cmdLine:C:hw909107FEwAD530939eJITMethodicalMethodical_others../tailcall_v4/hijacking/hijacking.cmd Timed Out (timeout in milliseconds: 1800000 from variable __TestTimeout, start: 9/13/2024 8:10:56 PM, end: 9/13/2024 8:40:56 PM)
Invoking: C:Program Files (x86)Windows Kits10Debuggersdcdb.exe -c "$<C:hw909107FE mpkrpkmd.tmp" -z "C:corescrashdump_2080.dmp"
stdout:
Microsoft (R) Windows Debugger Version 10.0.18362.1 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.
Loading Dump File [C:corescrashdump_2080.dmp]
User Mini Dump File with Full Memory: Only application data is available
************* Path validation summary **************
Response Time (ms) Location
OK C:hw909107FEpPDB
Symbol search path is: C:hw909107FEpPDB
Executable search path is:
Windows 10 Version 14393 MP (4 procs) Free x64
Product: Server, suite: TerminalServer DataCenter SingleUserTS
10.0.14393.6343 (rs1_release.230913-1727)
Machine Name:
Debug session time: Fri Sep 13 20:40:56.000 2024 (UTC + 0:00)
System Uptime: 0 days 0:50:23.232
Process Uptime: 0 days 0:30:00.000
............................
************* Symbol Loading Error Summary **************
Module name Error
ntdll The system cannot find the file specified
You can troubleshoot most symbol related issues by turning on symbol loading diagnostics (!sym noisy) and repeating the command that caused symbols to be loaded.
You should also verify that your symbol search path (.sympath) is correct.
For analysis of this file, run !analyze -v
ntdll!ZwWaitForSingleObject+0x14:
00007ffe`14196124 c3 ret
0:000> cdb: Reading initial command '$<C:hw909107FE mpkrpkmd.tmp'
0:000> .load C:Users
unner.dotnetsossos.dll
0:000> ~*k
. 0 Id: 820.13c0 Suspend: 0 Teb: 00000018`1924b000 Unfrozen
Child-SP RetAddr Call Site
00000018`1957dd08 00007ffe`106d6d1f ntdll!ZwWaitForSingleObject+0x14
*** WARNING: Unable to verify checksum for coreclr.dll
00000018`1957dd10 00007ffd`e58063ef KERNELBASE!WaitForSingleObjectEx+0x8f
00000018`1957ddb0 00007ffd`e56d764b coreclr!Thread::RareDisablePreemptiveGC+0x27f
*** WARNING: Unable to verify checksum for System.Private.CoreLib.dll
00000018`1957de80 00007ffd`e4134e3f coreclr!JIT_PInvokeEndRarePath+0x10b
*** W
Stack trace:
at Xunit.Assert.True(Nullable`1 condition, String userMessage) in /_/src/Microsoft.DotNet.XUnitAssert/src/BooleanAsserts.cs:line 146
at Xunit.Assert.True(Boolean condition, String userMessage) in /_/src/Microsoft.DotNet.XUnitAssert/src/BooleanAsserts.cs:line 128
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String assemblyPath, String testPathPrefix)
at Program.<<Main>$>g__TestExecutor55|2_56(StreamWriter tempLogSw, StreamWriter statsCsvSw, <>c__DisplayClass2_0&)
This did not reproduce in more than a month. It is possible it was a sideeffect of some other issue, which is now fixed.
I think this we can keep this open - in case it reproduces, but put it into 10.0 as it does not seem blocking.
Build Information
Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=765968 Build error leg or test failing: JIT/Methodical/tailcall_v4/hijacking/hijacking.cmd Pull request: https://github.com/dotnet/runtime/pull/105868
Error Message
Fill the error message using step by step known issues guidance.
Known issue validation
Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=765968 Error message validated:
[tailcall_v4/hijacking/hijacking.cmd Timed Out
] Result validation: :white_check_mark: Known issue matched with the provided build. Validation performed at: 8/8/2024 3:07:27 PM UTCReport
Summary