Author: | VincentBu |
---|---|
Assignees: | - |
Labels: | `arch-arm64`, `os-mac-os-x`, `GCStress`, `JitStress`, `area-CodeGen-coreclr`, `untriaged`, `blocking-clean-ci-optional` |
Milestone: | - |
Closed VincentBu closed 1 year ago
Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.
Author: | VincentBu |
---|---|
Assignees: | - |
Labels: | `arch-arm64`, `os-mac-os-x`, `GCStress`, `JitStress`, `area-CodeGen-coreclr`, `untriaged`, `blocking-clean-ci-optional` |
Milestone: | - |
@jakobbotsch - assigning to you since you have osx.
This looks like another occurrence of #68884. The test is disabled in GC stress, gets skipped, but the OutOfProcess execution mechanism seems to hang after that. cc @trylek, it seems like there is some form of intermittent deadlock in the OutOfProcess execution mechanism, do you have any idea how we can get to the bottom of it?
Tagging subscribers to this area: @hoyosjs See info in area-owners.md if you want to be subscribed.
Author: | VincentBu |
---|---|
Assignees: | jakobbotsch |
Labels: | `arch-arm64`, `os-mac-os-x`, `GCStress`, `JitStress`, `area-Infrastructure-coreclr`, `blocking-clean-ci-optional` |
Milestone: | 7.0.0 |
I managed to get a repro by running CoreclrTestWrapperLib.RunTest
in a loop with COMPlus_GCStress=c
and COMPlus_JitStress=2
on macOS arm64. Trying to track down now what the problem is.
@dotnet/area-system-diagnostics-process Can you please confirm that the following code looks correct: https://github.com/dotnet/runtime/blob/8d4a724883715cc5ae71e020b22762136f7ce50b/src/tests/Common/Coreclr.TestWrapper/CoreclrTestWrapperLib.cs#L306-L334
On macOS arm64 I am frequently hitting a hang on the WaitForExit
call when I run this code in a loop under COMPlus_GCStress=c
. The child process is bash that runs a test wrapper script that prints a message then immediately exits.
Note that GCStress introduces a lot of execution time variability. I suspect that is part of the problem.
Stack traces from all threads during the hang are available here.
Here is a test project and script to reproduce the problem. repro.zip To use:
repro.sh
is in current dirCOMPlus_GCStress=c corerun <output dll>
If the last step does not repro the hang within the first 100 iterations then ctrl-C and try step 4 again.
If I change the copy tasks to:
Task copyOutput = Task.Run(async () =>
{
try
{
Console.WriteLine($" [{index}] Starting CopyTo output");
await process.StandardOutput.BaseStream.CopyToAsync(outputStream, 4096, cts.Token);
Console.WriteLine($" [{index}] Done with CopyTo output");
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
Console.WriteLine(ex);
}
});
Task copyError = Task.Run(async () =>
{
try
{
Console.WriteLine($" [{index}] Starting CopyTo err");
await process.StandardError.BaseStream.CopyToAsync(errorStream, 4096, cts.Token);
Console.WriteLine($" [{index}] Done with CopyTo err");
}
catch (Exception ex) when (ex is not OperationCanceledException)
{
Console.WriteLine(ex);
}
});
Then on hang I see:
[2] Running
[2] WaitForExit(600000) on PID 73072
[2] Starting CopyTo err
[2] Starting CopyTo output
[2] Done with CopyTo err
[2] Done with CopyTo output
The bash process is still alive at this point.
Honestly I don't think that marking these tests as out-of-proc actually fixed the hangs we were seeing, it just made them much less frequent. My working theory based on Andrew's initial analysis of the problem is that the GC stress logic can sometimes cause some form of starvation w.r.t. draining the finalization queue. Apparently the problem was much more pronounced when running the tests in-proc, according to my working theory this is likely because the one bigger process comprising hundreds of test cases is naturally much heavier with regard to GC allocation compared to running just a single test case as a separate process. It is great that you managed to shed more light on what's happening as marking the tests out-of-proc was just a mitigation measure, not the real solution for the underlying problem.
What I said above about the bash process still being alive is not totally right. The bash process shows up as (bash)
in ps
. From my googling, the parentheses seem to indicate that something might have gone wrong with an execve
syscall, but I am not totally sure. I am also not able to attach lldb to the (bash)
process, so it might be that Process.Start actually failed somewhere.
@dotnet/area-system-diagnostics-process Can one of you please investigate this hang further? It might potentially be a GCStress issue, although it seems to happen on macOS arm64 only. It would be good to pin down the events that happen in Process.Start and Process.WaitForExit.
Tagging subscribers to this area: @dotnet/area-system-diagnostics-process See info in area-owners.md if you want to be subscribed.
Author: | VincentBu |
---|---|
Assignees: | - |
Labels: | `arch-arm64`, `area-System.Diagnostics.Process`, `os-mac-os-x`, `GCStress`, `blocking-clean-ci-optional` |
Milestone: | 7.0.0 |
Failed again in: runtime-coreclr gcstress-extra 20220611.1
Failed test:
coreclr Linux arm Checked gcstress0xc_zapdisable_heapverify1 @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440
- JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd
- JIT/Methodical/Arrays/lcs/lcs2_do/lcs2_do.cmd
- JIT/Methodical/Arrays/lcs/lcs2_ro/lcs2_ro.cmd
- JIT/Methodical/Boxing/boxunbox/KeepAliveBoxOpt/KeepAliveBoxOpt.cmd
coreclr Linux arm64 Checked gcstress0xc_zapdisable_heapverify1 @ (Ubuntu.1804.Arm64.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8-20210531091519-97d8652
- JIT/Methodical/Arrays/lcs/lcs2_ro/lcs2_ro.cmd
- JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd
- JIT/Methodical/Arrays/lcs/lcs2_do/lcs2_do.cmd
- JIT/Methodical/Arrays/lcs/lcs2_d/lcs2_d.cmd
- JIT/Methodical/fp/exgen/10w5d_cs_r/10w5d_cs_r.cmd
- JIT/Methodical/fp/exgen/10w5d_cs_d/10w5d_cs_d.cmd
- JIT/Methodical/Boxing/boxunbox/KeepAliveBoxOpt/KeepAliveBoxOpt.cmd
coreclr Linux arm64 Checked gcstress0xc_zapdisable_jitstress2 @ (Ubuntu.1804.Arm64.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8-20210531091519-97d8652
- JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd
- JIT/Methodical/Arrays/lcs/lcs2_d/lcs2_d.cmd
- JIT/Methodical/Arrays/lcs/lcs2_ro/lcs2_ro.cmd
- JIT/Methodical/Arrays/lcs/lcs2_do/lcs2_do.cmd
- JIT/Methodical/fp/exgen/10w5d_cs_r/10w5d_cs_r.cmd
- JIT/Methodical/fp/exgen/10w5d_cs_d/10w5d_cs_d.cmd
- JIT/Methodical/Boxing/boxunbox/KeepAliveBoxOpt/KeepAliveBoxOpt.cmd
coreclr OSX arm64 Checked gcstress0xc_zapdisable @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcs2_d/lcs2_d.cmd
coreclr OSX arm64 Checked gcstress0xc_jitstress2 @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd
Error message:
Test Infrastructure Failure: System.IO.IOException: Unknown error 42424 : '/root/helix/work/workitem/uploads/Arrays/lcs/lcs2_r'
at System.IO.FileSystem.CreateDirectory(String fullPath)
at System.IO.Directory.CreateDirectory(String path)
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String basePath, String assemblyPath)
Expected: True
Actual: False
Stack trace
at Xunit.Assert.True(Nullable`1 condition, String userMessage) in /_/src/xunit.assert/Asserts/BooleanAsserts.cs:line 132
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String basePath, String assemblyPath)
at Program.<Main>$(String[] args)
Queued | OS | Arch | Column1 |
---|---|---|---|
2022-06-12T08:37:37.57Z | osx.1200.arm64.open | arm64 | runtime-coreclr crossgen2-composite gcstress Checked-gcstress0xc_zapdisable_jitstress2 |
2022-05-16T08:50:14.127Z | osx.1200.arm64.open | arm64 | runtime-coreclr crossgen2-composite gcstress Checked-gcstress0xc_zapdisable |
2022-05-09T00:03:15.936Z | osx.1200.arm64.open | arm64 | runtime-coreclr gcstress-extra Checked-gcstress0xc_jitstress2 |
Failed again in: runtime-coreclr gcstress-extra 20220703.1
Failed test:
coreclr Linux arm64 Checked gcstress0xc_zapdisable_heapverify1 @ (Ubuntu.1804.Arm64.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8-20210531091519-97d8652
- JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd
- JIT/Methodical/Arrays/lcs/lcs2_d/lcs2_d.cmd
- JIT/Methodical/Arrays/lcs/lcs2_ro/lcs2_ro.cmd
- JIT/Methodical/Arrays/lcs/lcs2_do/lcs2_do.cmd
- JIT/Methodical/fp/exgen/10w5d_cs_r/10w5d_cs_r.cmd
- JIT/Methodical/fp/exgen/10w5d_cs_d/10w5d_cs_d.cmd
- JIT/Methodical/Boxing/boxunbox/KeepAliveBoxOpt/KeepAliveBoxOpt.cmd
coreclr Linux arm64 Checked gcstress0xc_zapdisable_jitstress2 @ (Ubuntu.1804.Arm64.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8-20210531091519-97d8652
- JIT/Methodical/Arrays/lcs/lcs2_d/lcs2_d.cmd
- JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd
- JIT/Methodical/Arrays/lcs/lcs2_ro/lcs2_ro.cmd
- JIT/Methodical/Arrays/lcs/lcs2_do/lcs2_do.cmd
- JIT/Methodical/fp/exgen/10w5d_cs_r/10w5d_cs_r.cmd
- JIT/Methodical/Boxing/boxunbox/KeepAliveBoxOpt/KeepAliveBoxOpt.cmd
- JIT/Methodical/fp/exgen/10w5d_cs_d/10w5d_cs_d.cmd
- JIT/Methodical/doublearray/dblarray1_cs_r/dblarray1_cs_r.cmd
coreclr OSX arm64 Checked gcstress0xc_zapdisable_heapverify1 @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcsvalbox_r/lcsvalbox_r.cmd
coreclr OSX arm64 Checked gcstress0xc_tailcallstress @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcs2_d/lcs2_d.cmd
Error message:
Test Infrastructure Failure: System.IO.IOException: Unknown error 42424 : '/root/helix/work/workitem/uploads/Arrays/lcs/lcs2_r'
at System.IO.FileSystem.CreateDirectory(String fullPath, UnixFileMode unixCreateMode)
at System.IO.Directory.CreateDirectory(String path)
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String basePath, String assemblyPath)
Expected: True
Actual: False
Stack trace
at Xunit.Assert.True(Nullable`1 condition, String userMessage) in /_/src/xunit.assert/Asserts/BooleanAsserts.cs:line 132
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String basePath, String assemblyPath)
at Program.<Main>$(String[] args)
Failed again in: runtime-coreclr gcstress0x3-gcstress0xc 20220717.1
Failed test:
coreclr OSX arm64 Checked gcstress0xc @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcs2_ro/lcs2_ro.cmd
Error message:
cmdLine:/private/tmp/helix/working/B61D0A0F/w/B27509B2/e/JIT/Methodical/Methodical_ro/../Arrays/lcs/lcs2_ro/lcs2_ro.sh Timed Out (timeout in milliseconds: 3600000 from variable __TestTimeout, start: 7/17/2022 8:25:33 AM, end: 7/17/2022 9:25:34 AM)
Return code: -100
Raw output file: /tmp/helix/working/B61D0A0F/w/B27509B2/uploads/Arrays/lcs/lcs2_ro/output.txt
Raw output:
SKIPPING EXECUTION BECAUSE COMPlus_GCStress IS SET
cmdLine:/private/tmp/helix/working/B61D0A0F/w/B27509B2/e/JIT/Methodical/Methodical_ro/../Arrays/lcs/lcs2_ro/lcs2_ro.sh Timed Out (timeout in milliseconds: 3600000 from variable __TestTimeout, start: 7/17/2022 8:25:33 AM, end: 7/17/2022 9:25:34 AM)
Test Harness Exitcode is : -100
To run the test:
Set up CORE_ROOT and run.
/private/tmp/helix/working/B61D0A0F/w/B27509B2/e/JIT/Methodical/Methodical_ro/../Arrays/lcs/lcs2_ro/lcs2_ro.sh
Expected: True
Actual: False
Stack trace
at Xunit.Assert.True(Nullable`1 condition, String userMessage) in /_/src/xunit.assert/Asserts/BooleanAsserts.cs:line 132
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String basePath, String assemblyPath)
at Program.<Main>$(String[] args)
@dotnet/area-system-diagnostics-process, @jeffhandley this fails frequently and blocks gcstress test. Please take a look.
@dotnet/area-system-diagnostics-process Can one of you please investigate this hang further? It might potentially be a GCStress issue, although it seems to happen on macOS arm64 only. It would be good to pin down the events that happen in Process.Start and Process.WaitForExit.
Failed again in : runtime-coreclr jitstress 20220724.3
Failed test:
coreclr OSX arm64 Checked gcstress0xc_zapdisable_jitstress2 @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcsvalbox_d/lcsvalbox_d.cmd
- JIT/Methodical/Arrays/lcs/lcsvalbox_r/lcsvalbox_r.cmd
coreclr OSX arm64 Checked gcstress0xc_zapdisable_heapverify1 @ OSX.1200.ARM64.Open
- JIT/Methodical/Coverage/copy_prop_byref_to_native_int/copy_prop_byref_to_native_int.cmd
Error message:
cmdLine:/private/tmp/helix/working/9B7B093A/w/B1150A05/e/JIT/Methodical/Methodical_d1/../Arrays/lcs/lcsvalbox_d/lcsvalbox_d.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 7/24/2022 4:29:54 PM, end: 7/24/2022 5:59:54 PM)
Return code: -100
Raw output file: /tmp/helix/working/9B7B093A/w/B1150A05/uploads/Arrays/lcs/lcsvalbox_d/output.txt
Raw output:
SKIPPING EXECUTION BECAUSE COMPlus_GCStress IS SET
cmdLine:/private/tmp/helix/working/9B7B093A/w/B1150A05/e/JIT/Methodical/Methodical_d1/../Arrays/lcs/lcsvalbox_d/lcsvalbox_d.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 7/24/2022 4:29:54 PM, end: 7/24/2022 5:59:54 PM)
Test Harness Exitcode is : -100
To run the test:
Set up CORE_ROOT and run.
/private/tmp/helix/working/9B7B093A/w/B1150A05/e/JIT/Methodical/Methodical_d1/../Arrays/lcs/lcsvalbox_d/lcsvalbox_d.sh
Expected: True
Actual: False
Stack trace
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String basePath, String assemblyPath)
at Program.<Main>$(String[] args)
Queued | OS | Arch | Column1 |
---|---|---|---|
2022-07-03T23:51:16.641Z | osx.1200.arm64.open | arm64 | runtime-coreclr gcstress-extra Checked-gcstress0xc_zapdisable |
Failed again in: runtime-coreclr gcstress0x3-gcstress0xc 20220807.1
Failed test:
coreclr Linux arm Checked gcstress0x3 @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440
- ilverify/ILVerificationTests.sh
coreclr Linux arm64 Checked gcstress0x3 @ (Ubuntu.1804.Arm64.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8-20210531091519-97d8652
- JIT/Regression/JitBlue/Runtime_72845/Runtime_72845/Runtime_72845.sh
Error message:
cmdLine:/root/helix/work/workitem/e/ilverify/ILVerificationTests.sh Timed Out (timeout in milliseconds: 3600000 from variable __TestTimeout, start: 8/7/2022 4:11:04 PM, end: 8/7/2022 5:11:05 PM)
Return code: -100
Raw output file: /root/helix/work/workitem/uploads/Reports/ilverify/ILVerificationTests.output.txt
Raw output:
BEGIN EXECUTION
/root/helix/work/correlation/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false ILVerificationTests.dll ''
cmdLine:/root/helix/work/workitem/e/ilverify/ILVerificationTests.sh Timed Out (timeout in milliseconds: 3600000 from variable __TestTimeout, start: 8/7/2022 4:11:04 PM, end: 8/7/2022 5:11:05 PM)
Test Harness Exitcode is : -100
To run the test:
set CORE_ROOT=/root/helix/work/correlation
/root/helix/work/workitem/e/ilverify/ILVerificationTests.sh
Expected: True
Actual: False
Stack trace
at ilverify._ILVerificationTests_._ILVerificationTests_sh()
at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
Queued | OS | Arch | Column1 |
---|---|---|---|
2022-08-07T14:40:21.014Z | ubuntu.1804.armarch.open | arm | runtime-coreclr gcstress0x3-gcstress0xc Checked-gcstress0x3 |
2022-08-07T09:14:58.202Z | osx.1200.amd64.open | x64 | runtime-coreclr crossgen2-composite gcstress Checked-gcstress0xc_jitminopts_heapverify1 |
2022-08-07T08:59:42.467Z | osx.1200.amd64.open | x64 | runtime-coreclr crossgen2-composite gcstress Checked-gcstress0xc_tailcallstress |
2022-08-07T08:58:31.884Z | osx.1200.amd64.open | x64 | runtime-coreclr crossgen2-composite gcstress Checked-gcstress0xc_jitstress2 |
2022-08-07T08:43:20.985Z | osx.1200.amd64.open | x64 | runtime-coreclr crossgen2-composite gcstress Checked-gcstress0xc_jitstress1 |
I'm moving this to the 8.0.0 milestone. After chatting with @adamsitnik we don't think this is a product issue because we don't have any architecture-specific code in play here. But we'll take a look from the product side after the 7.0.0 RC1 snap.
Failed again in: runtime-coreclr gcstress-extra 20220814.1
Failed test:
coreclr OSX arm64 Checked gcstress0xc_zapdisable @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcsvalbox_d/lcsvalbox_d.cmd
coreclr OSX arm64 Checked gcstress0xc_zapdisable_heapverify1 @ OSX.1200.ARM64.Open
- JIT/Methodical/tailcall_v4/hijacking/hijacking.cmd
Error message:
cmdLine:/private/tmp/helix/working/AD430949/w/AE270949/e/JIT/Methodical/Methodical_d1/../Arrays/lcs/lcsvalbox_d/lcsvalbox_d.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 8/14/2022 4:39:33 PM, end: 8/14/2022 6:09:33 PM)
Return code: -100
Raw output file: /tmp/helix/working/AD430949/w/AE270949/uploads/Arrays/lcs/lcsvalbox_d/output.txt
Raw output:
SKIPPING EXECUTION BECAUSE COMPlus_GCStress IS SET
cmdLine:/private/tmp/helix/working/AD430949/w/AE270949/e/JIT/Methodical/Methodical_d1/../Arrays/lcs/lcsvalbox_d/lcsvalbox_d.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 8/14/2022 4:39:33 PM, end: 8/14/2022 6:09:33 PM)
Test Harness Exitcode is : -100
To run the test:
Set up CORE_ROOT and run.
/private/tmp/helix/working/AD430949/w/AE270949/e/JIT/Methodical/Methodical_d1/../Arrays/lcs/lcsvalbox_d/lcsvalbox_d.sh
Expected: True
Actual: False
Stack trace
at Xunit.Assert.True(Nullable`1 condition, String userMessage) in /_/src/xunit.assert/Asserts/BooleanAsserts.cs:line 132
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String basePath, String assemblyPath)
at Program.<Main>$(String[] args)
Failed again in: runtime-coreclr gcstress0x3-gcstress0xc 20220828.1
Failed test:
coreclr Linux arm Checked gcstress0x3 @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440
- JIT/Regression/JitBlue/Runtime_72845/Runtime_72845/Runtime_72845.sh
coreclr Linux arm64 Checked gcstress0x3 @ (Ubuntu.1804.Arm64.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8-20220824230426-06f234f
- JIT/Regression/JitBlue/Runtime_72845/Runtime_72845/Runtime_72845.sh
coreclr OSX arm64 Checked gcstress0xc @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcsvalbox_do/lcsvalbox_do.cmd
Error message:
cmdLine:/root/helix/work/workitem/e/JIT/Regression/JitBlue/Runtime_72845/Runtime_72845/Runtime_72845.sh Timed Out (timeout in milliseconds: 3600000 from variable __TestTimeout, start: 8/28/2022 3:11:57 PM, end: 8/28/2022 4:11:57 PM)
Return code: -100
Raw output file: /root/helix/work/workitem/uploads/Reports/JIT.Regression/JitBlue/Runtime_72845/Runtime_72845/Runtime_72845.output.txt
Raw output:
BEGIN EXECUTION
/root/helix/work/correlation/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false Runtime_72845.dll ''
cmdLine:/root/helix/work/workitem/e/JIT/Regression/JitBlue/Runtime_72845/Runtime_72845/Runtime_72845.sh Timed Out (timeout in milliseconds: 3600000 from variable __TestTimeout, start: 8/28/2022 3:11:57 PM, end: 8/28/2022 4:11:57 PM)
Test Harness Exitcode is : -100
To run the test:
set CORE_ROOT=/root/helix/work/correlation
/root/helix/work/workitem/e/JIT/Regression/JitBlue/Runtime_72845/Runtime_72845/Runtime_72845.sh
Expected: True
Actual: False
Stack trace
at JIT_Regression._JitBlue_Runtime_72845_Runtime_72845_Runtime_72845_._JitBlue_Runtime_72845_Runtime_72845_Runtime_72845_sh()
at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
Queued | OS | Arch | Column1 |
---|---|---|---|
2022-08-28T15:08:37.572Z | ubuntu.1804.armarch.open | arm64 | runtime-coreclr gcstress0x3-gcstress0xc Checked-gcstress0x3 |
2022-08-28T15:05:04.526Z | ubuntu.1804.armarch.open | arm | runtime-coreclr gcstress0x3-gcstress0xc Checked-gcstress0x3 |
2022-08-27T14:54:10.863Z | ubuntu.1804.armarch.open | arm64 | runtime-coreclr gcstress0x3-gcstress0xc Checked-gcstress0x3 |
2022-08-27T14:50:41.211Z | ubuntu.1804.armarch.open | arm | runtime-coreclr gcstress0x3-gcstress0xc Checked-gcstress0x3 |
2022-08-21T14:49:30.007Z | ubuntu.1804.armarch.open | arm | runtime-coreclr gcstress0x3-gcstress0xc Checked-gcstress0x3 |
Failed again in: runtime-coreclr gcstress-extra 20220904.1
Failed test:
coreclr OSX arm64 Checked gcstress0xc_zapdisable_heapverify1 @ OSX.1200.ARM64.Open
- JIT/Methodical/Coverage/copy_prop_byref_to_native_int/copy_prop_byref_to_native_int.cmd
coreclr OSX arm64 Checked gcstress0xc_jitstress2 @ OSX.1200.ARM64.Open
- JIT/Methodical/Boxing/boxunbox/KeepAliveBoxOpt/KeepAliveBoxOpt.cmd
coreclr OSX arm64 Checked gcstress0xc_jitstress1 @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd
coreclr OSX arm64 Checked gcstress0xc_zapdisable_jitstress2 @ OSX.1200.ARM64.Open
- JIT/Methodical/refany/stress1_r/stress1_r.cmd
coreclr windows x86 Checked gcstress0xc_jitstress1 @ Windows.10.Amd64.Open
- baseservices\\TieredCompilation\\BasicTestWithMcj\\BasicTestWithMcj.cmd
coreclr windows x64 Checked gcstress0xc_jitstress1 @ Windows.10.Amd64.Open
- baseservices\\TieredCompilation\\BasicTestWithMcj\\BasicTestWithMcj.cmd
Error message:
cmdLine:/private/tmp/helix/working/B5640950/w/A9200938/e/JIT/Methodical/Methodical_others/../Coverage/copy_prop_byref_to_native_int/copy_prop_byref_to_native_int.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 9/4/2022 4:54:52 PM, end: 9/4/2022 6:24:53 PM)
Return code: -100
Raw output file: /tmp/helix/working/B5640950/w/A9200938/uploads/Coverage/copy_prop_byref_to_native_int/output.txt
Raw output:
BEGIN EXECUTION
/tmp/helix/working/B5640950/p/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false copy_prop_byref_to_native_int.dll ''
Expected: 100
Actual: 100
END EXECUTION - PASSED
cmdLine:/private/tmp/helix/working/B5640950/w/A9200938/e/JIT/Methodical/Methodical_others/../Coverage/copy_prop_byref_to_native_int/copy_prop_byref_to_native_int.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 9/4/2022 4:54:52 PM, end: 9/4/2022 6:24:53 PM)
Test Harness Exitcode is : -100
To run the test:
Set up CORE_ROOT and run.
/private/tmp/helix/working/B5640950/w/A9200938/e/JIT/Methodical/Methodical_others/../Coverage/copy_prop_byref_to_native_int/copy_prop_byref_to_native_int.sh
Expected: True
Actual: False
Stack trace
at Xunit.Assert.True(Nullable`1 condition, String userMessage) in /_/src/xunit.assert/Asserts/BooleanAsserts.cs:line 132
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String basePath, String assemblyPath)
at Program.<Main>$(String[] args)
Queued | OS | Arch | Column1 |
---|---|---|---|
2022-08-13T23:14:59.543Z | osx.1200.arm64.open | arm64 | runtime-coreclr gcstress-extra Checked-gcstress0xc_zapdisable |
Failed again in: runtime-coreclr gcstress-extra 20220917.1
Failed test:
coreclr windows x86 Checked gcstress0xc_zapdisable_heapverify1 @ Windows.10.Amd64.Open
- JIT\\Regression\\CLR-x86-JIT\\V1.2-M01\\b16473\\b16473\\b16473.cmd
coreclr Linux arm64 Checked gcstress0xc_jitstress2 @ (Ubuntu.1804.Arm64.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8-20220824230426-06f234f
- baseservices/exceptions/simple/ParallelCrashMainThread/ParallelCrashMainThread.sh
coreclr windows x86 Checked gcstress0xc_jitminopts_heapverify1 @ Windows.10.Amd64.Open
- JIT\\Regression\\CLR-x86-JIT\\V1.2-M01\\b16473\\b16473\\b16473.cmd
coreclr OSX arm64 Checked gcstress0xc_zapdisable @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcsvalbox_r/lcsvalbox_r.cmd
coreclr OSX arm64 Checked gcstress0xc_tailcallstress @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/misc/arrres_il_r/arrres.cmd
- JIT/Methodical/Coverage/copy_prop_byref_to_native_int/copy_prop_byref_to_native_int.cmd
coreclr OSX arm64 Checked gcstress0xc_zapdisable_heapverify1 @ OSX.1200.ARM64.Open
- JIT/Methodical/VT/port/lcs_gcref_d/lcs_gcref_d.cmd
coreclr OSX arm64 Checked gcstress0xc_zapdisable_jitstress2 @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcsvalbox_do/lcsvalbox_do.cmd
coreclr OSX arm64 Checked gcstress0xc_jitstress1 @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcs2_ro/lcs2_ro.cmd
coreclr OSX arm64 Checked gcstress0xc_jitstress2 @ OSX.1200.ARM64.Open
- JIT/Methodical/Coverage/copy_prop_byref_to_native_int/copy_prop_byref_to_native_int.cmd
Error message:
cmdLine:C:\h\w\A8D9094A\w\ABDB09E4\e\JIT\Regression\CLR-x86-JIT\V1.2-M01\b16473\b16473\b16473.cmd Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 9/17/2022 11:36:17 PM, end: 9/18/2022 1:06:17 AM)
Return code: -100
Raw output file: C:\h\w\A8D9094A\w\ABDB09E4\uploads\Reports\JIT.Regression\CLR-x86-JIT\V1.2-M01\b16473\b16473\b16473.output.txt
Raw output:
BEGIN EXECUTION
"C:\h\w\A8D9094A\p\corerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false" b16473.dll
cmdLine:C:\h\w\A8D9094A\w\ABDB09E4\e\JIT\Regression\CLR-x86-JIT\V1.2-M01\b16473\b16473\b16473.cmd Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 9/17/2022 11:36:17 PM, end: 9/18/2022 1:06:17 AM)
Test Harness Exitcode is : -100
To run the test:
set CORE_ROOT=C:\h\w\A8D9094A\p
C:\h\w\A8D9094A\w\ABDB09E4\e\JIT\Regression\CLR-x86-JIT\V1.2-M01\b16473\b16473\b16473.cmd
Expected: True
Actual: False
Stack trace
at JIT_Regression._CLR_x86_JIT_V1_2_M01_b16473_b16473_b16473_._CLR_x86_JIT_V1_2_M01_b16473_b16473_b16473_cmd()
at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
Queued | OS | Arch | Column1 |
---|---|---|---|
2022-09-18T00:03:12.997Z | ubuntu.1804.armarch.open | arm64 | runtime-coreclr gcstress-extra Checked-gcstress0xc_jitstress2 |
2022-08-29T08:02:18.821Z | osx.1200.arm64.open | arm64 | runtime-coreclr crossgen2-composite gcstress Checked-gcstress0xc_jitstress1 |
2022-08-22T10:16:31.694Z | osx.1200.arm64.open | arm64 | runtime-coreclr crossgen2-composite gcstress Checked-gcstress0xc_jitminopts_heapverify1 |
2022-08-08T09:54:22.672Z | osx.1200.arm64.open | arm64 | runtime-coreclr crossgen2-composite gcstress Checked-gcstress0xc_tailcallstress |
2022-08-07T00:28:12.12Z | ubuntu.1804.armarch.open | arm64 | runtime-coreclr gcstress-extra Checked-gcstress0xc_tailcallstress |
2022-07-24T09:26:14.44Z | osx.1200.arm64.open | arm64 | runtime-coreclr crossgen2-composite gcstress Checked-gcstress0xc_jitminopts_heapverify1 |
2022-08-13T23:14:59.543Z | osx.1200.arm64.open | arm64 | runtime-coreclr gcstress-extra Checked-gcstress0xc_zapdisable |
Failed again in: runtime-coreclr gcstress-extra 20220925.1
Failed test:
coreclr windows x86 Checked gcstress0xc_jitminopts_heapverify1 @ Windows.10.Amd64.Open
- JIT\\Regression\\CLR-x86-JIT\\V1.2-M01\\b16473\\b16473\\b16473.cmd
coreclr windows x86 Checked gcstress0xc_zapdisable_heapverify1 @ Windows.10.Amd64.Open
- JIT\\Regression\\CLR-x86-JIT\\V1.2-M01\\b16473\\b16473\\b16473.cmd
coreclr OSX arm64 Checked gcstress0xc_zapdisable_jitstress2 @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd
coreclr OSX arm64 Checked gcstress0xc_jitstress1 @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcs2_ro/lcs2_ro.cmd
- JIT/Methodical/Arrays/lcs/lcs2_r/lcs2_r.cmd
coreclr OSX arm64 Checked gcstress0xc_zapdisable @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcs2_ro/lcs2_ro.cmd
coreclr OSX arm64 Checked gcstress0xc_zapdisable_heapverify1 @ OSX.1200.ARM64.Open
- JIT/Methodical/tailcall_v4/hijacking/hijacking.cmd
Error message:
cmdLine:C:\h\w\C4220A65\w\A4C708DB\e\JIT\Regression\CLR-x86-JIT\V1.2-M01\b16473\b16473\b16473.cmd Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 9/26/2022 1:45:54 AM, end: 9/26/2022 3:15:54 AM)
Return code: -100
Raw output file: C:\h\w\C4220A65\w\A4C708DB\uploads\Reports\JIT.Regression\CLR-x86-JIT\V1.2-M01\b16473\b16473\b16473.output.txt
Raw output:
BEGIN EXECUTION
"C:\h\w\C4220A65\p\corerun.exe" -p "System.Reflection.Metadata.MetadataUpdater.IsSupported=false" b16473.dll
cmdLine:C:\h\w\C4220A65\w\A4C708DB\e\JIT\Regression\CLR-x86-JIT\V1.2-M01\b16473\b16473\b16473.cmd Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 9/26/2022 1:45:54 AM, end: 9/26/2022 3:15:54 AM)
Test Harness Exitcode is : -100
To run the test:
set CORE_ROOT=C:\h\w\C4220A65\p
C:\h\w\C4220A65\w\A4C708DB\e\JIT\Regression\CLR-x86-JIT\V1.2-M01\b16473\b16473\b16473.cmd
Expected: True
Actual: False
Stack trace
at JIT_Regression._CLR_x86_JIT_V1_2_M01_b16473_b16473_b16473_._CLR_x86_JIT_V1_2_M01_b16473_b16473_b16473_cmd()
at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
Queued | OS | Arch | Column1 |
---|---|---|---|
2022-08-13T23:14:59.543Z | osx.1200.arm64.open | arm64 | runtime-coreclr gcstress-extra Checked-gcstress0xc_zapdisable |
Failed again in: runtime-coreclr gcstress0x3-gcstress0xc 20221016.1
Failed test:
coreclr OSX arm64 Checked gcstress0xc @ OSX.1200.ARM64.Open
- JIT/Methodical/Coverage/copy_prop_byref_to_native_int/copy_prop_byref_to_native_int.cmd
Error message:
cmdLine:/private/tmp/helix/working/C1540A31/w/9FC608C6/e/JIT/Methodical/Methodical_others/../Coverage/copy_prop_byref_to_native_int/copy_prop_byref_to_native_int.sh Timed Out (timeout in milliseconds: 3600000 from variable __TestTimeout, start: 10/16/2022 8:30:35 AM, end: 10/16/2022 9:30:35 AM)
Return code: -100
Raw output file: /tmp/helix/working/C1540A31/w/9FC608C6/uploads/Coverage/copy_prop_byref_to_native_int/output.txt
Raw output:
BEGIN EXECUTION
/tmp/helix/working/C1540A31/p/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false copy_prop_byref_to_native_int.dll ''
Expected: 100
Actual: 100
END EXECUTION - PASSED
cmdLine:/private/tmp/helix/working/C1540A31/w/9FC608C6/e/JIT/Methodical/Methodical_others/../Coverage/copy_prop_byref_to_native_int/copy_prop_byref_to_native_int.sh Timed Out (timeout in milliseconds: 3600000 from variable __TestTimeout, start: 10/16/2022 8:30:35 AM, end: 10/16/2022 9:30:35 AM)
Test Harness Exitcode is : -100
To run the test:
Set up CORE_ROOT and run.
/private/tmp/helix/working/C1540A31/w/9FC608C6/e/JIT/Methodical/Methodical_others/../Coverage/copy_prop_byref_to_native_int/copy_prop_byref_to_native_int.sh
Expected: True
Actual: False
Stack trace
at Xunit.Assert.True(Nullable`1 condition, String userMessage) in /_/src/xunit.assert/Asserts/BooleanAsserts.cs:line 132
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String basePath, String assemblyPath)
at Program.<Main>$(String[] args)
@jakobbotsch the times that the process hangs OnSigChild
is not being called, SetExited()
is also not called somewhere else (that is called on OnSigChild->CheckChildren->TryReapChild->SetExited).
This means that _exitedEvent
is never Set()
and hence the hang on: https://github.com/dotnet/runtime/blob/58a1180ad6421c6f69cf307b57b2496c888585f2/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/ProcessWaitState.Unix.cs#L411
But I also noticed that _isChild
is always true for all processes created in CoreclrTestWrapperLib.RunTest()
, I'm not sure if that's correct.
So either one of these reasons is true:
OnSigChild should never miss execution for a process to successfully exit. Right now is missing on the hang iteration of your repro, for some reason. I suspect is because the native code that is meant to call it fails silently.
The time OnSigChild is not being called is by design and _isChild should not be true, code should go through the !_isChild
code path and be able to exit.
We should be calling SetExited()
somewhere else.
@dotnet/area-system-diagnostics-process @tmds @stephentoub @jkotas is OnSigChild
a "must execute" method for a process to successfully exit?
As @jakobbotsch mentioned, the process is in a bad state after execve
is called although execve
does not return error.
https://github.com/dotnet/runtime/blob/cc5ba0994d6e8a6f5e4a63d1c921a68eda4350e8/src/native/libs/System.Native/pal_process.c#L423-L424
Now, that causes that the OnSigChild
callback never triggers because native code stops running, for some reason.
https://github.com/dotnet/runtime/blob/cc5ba0994d6e8a6f5e4a63d1c921a68eda4350e8/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs#L1090-L1091
I confirmed that by adding File.WriteAllText("/dev/tty", "Some debug message")
to OnSigChild
and observing that it doesn't print anything to the terminal when the hang occurs.
I also went to pal_signal.c
and added printf
s to the code that is meant to call OnSigChild
and that also doesn't print on hangs https://github.com/dotnet/runtime/blob/cc5ba0994d6e8a6f5e4a63d1c921a68eda4350e8/src/native/libs/System.Native/pal_signal.c#L385-L390
I also attempted to log rather than printing i.e: write to a file from native c, and that confirmed that that code is indeed not executing as nothing is written when the test loop hangs.
I repeated printing and logging for SignalHandlerLoop
and SystemNative_RegisterForSigChld
, those also don't execute on hangs.
It appears to me that the last thing that executes on native before the hang is SystemNative_ForkAndExecProcess
, which is the function calling execve
.
This relates with my previous comment as process.WaitForExit(timeout)
depends on OnSigChild
to be called.
I assume this is some weird interaction between GCStress and execve
on macOS arm64. I will stop looking into it and hope that someone more familiar with GCStress can take a look.
cc @janvorli
Failed again in: runtime-coreclr gcstress0x3-gcstress0xc 20221030.1
Failed test:
coreclr Linux arm Checked gcstress0x3 @ (Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440
- ilverify/ILVerificationTests.sh
Error message:
cmdLine:/root/helix/work/workitem/e/ilverify/ILVerificationTests.sh Timed Out (timeout in milliseconds: 3600000 from variable __TestTimeout, start: 10/30/2022 3:23:30 PM, end: 10/30/2022 4:23:30 PM)
Return code: -100
Raw output file: /root/helix/work/workitem/uploads/Reports/ilverify/ILVerificationTests.output.txt
Raw output:
BEGIN EXECUTION
/root/helix/work/correlation/corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false ILVerificationTests.dll ''
cmdLine:/root/helix/work/workitem/e/ilverify/ILVerificationTests.sh Timed Out (timeout in milliseconds: 3600000 from variable __TestTimeout, start: 10/30/2022 3:23:30 PM, end: 10/30/2022 4:23:30 PM)
Test Harness Exitcode is : -100
To run the test:
set CORE_ROOT=/root/helix/work/correlation
/root/helix/work/workitem/e/ilverify/ILVerificationTests.sh
Expected: True
Actual: False
Stack trace
at ilverify._ILVerificationTests_._ILVerificationTests_sh()
at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
Queued | OS | Arch | Column1 |
---|---|---|---|
2022-10-30T15:19:14.782Z | ubuntu.1804.armarch.open | arm | runtime-coreclr gcstress0x3-gcstress0xc Checked-gcstress0x3 |
2022-10-29T15:04:12.955Z | ubuntu.1804.armarch.open | arm | runtime-coreclr gcstress0x3-gcstress0xc Checked-gcstress0x3 |
2022-10-26T06:46:07.151Z | ubuntu.1804.armarch.open | arm | runtime-coreclr jitstress Checked-jitstress2_tiered |
2022-10-26T06:01:02.055Z | ubuntu.1804.armarch.open | arm | runtime-coreclr jitstress Checked-jitstress2 |
2022-10-26T06:01:01.03Z | ubuntu.1804.armarch.open | arm | runtime-coreclr jitstress Checked-jitstress1_tiered |
2022-10-26T06:00:58.629Z | ubuntu.1804.armarch.open | arm | runtime-coreclr jitstress Checked-jitstress1 |
Failed in Run: runtime-coreclr gcstress-extra 20221106.1
Failed tests:
coreclr OSX arm64 Checked gcstress0xc_zapdisable_jitstress2 @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcsvalbox_ro/lcsvalbox_ro.cmd
- JIT/Methodical/Arrays/lcs/lcsvalbox_r/lcsvalbox_r.cmd
coreclr OSX arm64 Checked gcstress0xc_jitstress1 @ OSX.1200.ARM64.Open
- JIT/Methodical/Coverage/copy_prop_byref_to_native_int/copy_prop_byref_to_native_int.cmd
- JIT/Methodical/Arrays/lcs/lcsvalbox_do/lcsvalbox_do.cmd
coreclr OSX arm64 Checked gcstress0xc_tailcallstress @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcsvalbox_r/lcsvalbox_r.cmd
Error message:
cmdLine:/private/tmp/helix/working/B1300992/w/B0CD099D/e/JIT/Methodical/Methodical_ro/../Arrays/lcs/lcsvalbox_ro/lcsvalbox_ro.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 11/6/2022 3:18:57 PM, end: 11/6/2022 4:48:58 PM)
Return code: -100
Raw output file: /tmp/helix/working/B1300992/w/B0CD099D/uploads/Arrays/lcs/lcsvalbox_ro/output.txt
Raw output:
SKIPPING EXECUTION BECAUSE DOTNET_GCStress IS SET
cmdLine:/private/tmp/helix/working/B1300992/w/B0CD099D/e/JIT/Methodical/Methodical_ro/../Arrays/lcs/lcsvalbox_ro/lcsvalbox_ro.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 11/6/2022 3:18:57 PM, end: 11/6/2022 4:48:58 PM)
Test Harness Exitcode is : -100
To run the test:
Set up CORE_ROOT and run.
> /private/tmp/helix/working/B1300992/w/B0CD099D/e/JIT/Methodical/Methodical_ro/../Arrays/lcs/lcsvalbox_ro/lcsvalbox_ro.sh
Expected: True
Actual: False
Stack trace:
at Xunit.Assert.True(Nullable`1 condition, String userMessage) in /_/src/xunit.assert/Asserts/BooleanAsserts.cs:line 132
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String basePath, String assemblyPath)
at Program.<Main>$(String[] args)
@janvorli, could you take a look at this issue? It is open since May, so it would be great if you can look into it.
@JulieLeeMSFT of course!
@JulieLeeMSFT of course!
Thanks @janvorli. Any update on this?
I have spent some time investigating this but so far I wasn't successful in figuring out the culprit. From logging that I have added, it is clear that the target process was forked, the execve was called, but then the target process becomes zombie. I keep trying to figure out means to find out more.
Failed again in: runtime-coreclr gcstress-extra 20221120.1
Failed test:
coreclr OSX arm64 Checked gcstress0xc_jitminopts_heapverify1 @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcsvalbox_r/lcsvalbox_r.cmd
coreclr OSX arm64 Checked gcstress0xc_zapdisable_jitstress2 @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcsvalbox_r/lcsvalbox_r.cmd
coreclr OSX arm64 Checked gcstress0xc_zapdisable @ OSX.1200.ARM64.Open
- JIT/Methodical/Arrays/lcs/lcs2_do/lcs2_do.cmd
- JIT/Methodical/Arrays/lcs/lcsvalbox_d/lcsvalbox_d.cmd
Error message:
cmdLine:/private/tmp/helix/working/AD5B09B6/w/9C8B091F/e/JIT/Methodical/Methodical_r1/../Arrays/lcs/lcsvalbox_r/lcsvalbox_r.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 11/20/2022 6:12:25 PM, end: 11/20/2022 7:42:27 PM)
Return code: -100
Raw output file: /tmp/helix/working/AD5B09B6/w/9C8B091F/uploads/Arrays/lcs/lcsvalbox_r/output.txt
Raw output:
SKIPPING EXECUTION BECAUSE DOTNET_GCStress IS SET
cmdLine:/private/tmp/helix/working/AD5B09B6/w/9C8B091F/e/JIT/Methodical/Methodical_r1/../Arrays/lcs/lcsvalbox_r/lcsvalbox_r.sh Timed Out (timeout in milliseconds: 5400000 from variable __TestTimeout, start: 11/20/2022 6:12:25 PM, end: 11/20/2022 7:42:27 PM)
Test Harness Exitcode is : -100
To run the test:
Set up CORE_ROOT and run.
/private/tmp/helix/working/AD5B09B6/w/9C8B091F/e/JIT/Methodical/Methodical_r1/../Arrays/lcs/lcsvalbox_r/lcsvalbox_r.sh
Expected: True
Actual: False
Stack trace
at Xunit.Assert.True(Nullable`1 condition, String userMessage) in /_/src/xunit.assert/Asserts/BooleanAsserts.cs:line 132
at Xunit.Assert.True(Boolean condition, String userMessage) in /_/src/xunit.assert/Asserts/BooleanAsserts.cs:line 116
at TestLibrary.OutOfProcessTest.RunOutOfProcessTest(String basePath, String assemblyPath)
at Program.<<Main>$>g__TestExecutor8|0_7(<>c__DisplayClass0_0&)
More update. Monitoring the child process in the repro test (bash) using the dtruss
tool, I have found that the child is actually running to completion even in the case of a hang. So, the problem is in the delivery of the SIGCHILD. I keep investigating.
I've forgotten to share more details I've found. I've used dtrace to monitor the delivery of the SIGCHILD. It reports that the signal was delivered to the target thread, but yet the handler was never called. I have thought that we might be doing something wrong in the signal handler or the thread that we use to invoke the handling code, but then I've found that when it hangs, the ctrl-c is delivered through the same signal handler and code path just fine. I am starting to suspect some bad interaction between signal delivery and very frequent hardware exceptions on macOS. I keep investigating.
I think I have a fix. I am testing it now to make sure, but it looks promising.
The problem is effectively the same as the one we had with activation injection via thread suspend / modify context / thread resume way. The thing is that kernel can update the context of a suspended thread any time, one of the reasons are async signals. So what happens is that kernel updates the context with the SIGCHLD signal handler address, but we then overwrite it by our code and thus the signal delivery never occurs.
Awesome! Looking forward to gcstress runs not being plagued by this issue 🙂 Why does the race only happen on macOS and not on Linux as well?
Why does the race only happen on macOS and not on Linux as well?
Because Linux uses signals for hardware exception handling. For signals, there is no race, since we just return from the signal handler with an updated context.
Run:runtime-coreclr gcstress-extra 20220508.1
Failed test:
Error message: