Closed v-wenyuxu closed 2 months ago
Doesn't look to be jitstress specific, fails on:
using System.Runtime.CompilerServices;
using Xunit;
class KeepAliveNullTest
{
private static void Main()
{
for (int i = 0; i < 200000; i++)
{
TestObject.Finalized = false;
MakeAndNull();
GC.Collect();
GC.WaitForPendingFinalizers();
Assert.True(TestObject.Finalized);
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void MakeAndNull()
{
var deadObj = new TestObject();
// it's dead here
}
public class TestObject
{
public static bool Finalized { get; set; }
~TestObject() => Finalized = true;
}
}
with DOTNET_TieredCompilation=0
. @dotnet/gc @VSadov any recent changes in GC could break it? It used to pass..
Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.
Unless [MethodImpl(MethodImplOptions.NoInlining)]
is not respected or the allocation of the object is somehow optimized away, I do not see how JIT or root reporting could cause any changes in behavior here.
This is a very simple case. The object gets unreachable and GC.Collect() should see that.
Could it be that NoInlining is not working?
Unless
[MethodImpl(MethodImplOptions.NoInlining)]
is not respected or the allocation of the object is somehow optimized away, I do not see how JIT or root reporting could cause any changes in behavior here.This is a very simple case. The object gets unreachable and GC.Collect() should see that.
Could it be that NoInlining is not working?
I've just checked - both NoInlining
is correctly respected and the allocation is not optimized away (we normally do remove such allocations unless their classes have finalizers). So sounds like some change in GC's behavior
Seems like the DATAS is the culprit, the test doesn't fail with DOTNET_GCDynamicAdaptationMode=0
My initial bisection suggests https://github.com/dotnet/runtime/compare/f69972fb744524a7d5345600e1b9b51c3fac736c...d2dbdd0bbbcb16cc53708a089e1b0388a57a5768, though the repro is intermittent so it's possible I just haven't seen it repro with f69972fb744524a7d5345600e1b9b51c3fac736c yet. In any case if that's the range #103501 looks most interesting, cc @jkotas
I do not think that the problem is with GC not collecting the object. I think the problem is in the WaitForPendingFinalizers() synchronization. If I change the loop to be:
for (int i = 0; i < 20000000; i++)
{
TestObject.Finalized = false;
MakeAndNull();
GC.Collect();
GC.WaitForPendingFinalizers();
if (!TestObject.Finalized) Console.WriteLine("FAIL");
if (!TestObject.Finalized) Console.WriteLine("FAIL2");
}
I never see FAIL
+ FAIL2
printed. I always only see FAIL
printed. It suggests that the problem is in GC.WaitForPendingFinalizers
returning too early before the finalizer got a chance to run.
The code has a comment about such race condition and considers it acceptable: https://github.com/dotnet/runtime/blob/104e88df9494dbb87b38da50fcdf0b4f09733743/src/coreclr/vm/finalizerthread.cpp#L409-L412
103501
This change made finalization loop more efficient. It makes any race conditions related to finalization more likely to show up.
The code has a comment about such race condition and considers it acceptable:
While it's understandable, I bet we have a ton of test cases here and there relying on WaitForPendingFinalizers
calling the finalizers, effectively, synchronously (all of them are guarded by "is gc precise" flag to ignore mono). What I am saying is that we might see more flaky test failures like this then
I think that race goes like:
But can this happen in a completely single-threaded test? (and as I understand happens fairly reliably now) Who causes "some unrelated GC happens"?
Who causes "some unrelated GC happens"?
Background event source activity. There is a ton of it on typical machine.
I have seen code where sequence Collect(); WaitForPendingFinalizers();
is always called twice (or three times) in a row.
My thought was always - "why not 4 or 42 times? was N enough to pass tests?" :-)
Perhaps the WFPF should check if the queue is empty and if not, wait on the event one more time before returning?
This API does not need to be very performant, but being less racey, if possible, could be good.
Can we add some flag or counter to guarantee that GC.WaitForPendingFinalizers
call always waits for full turn of the crank?
Failed in: runtime-coreclr libraries-jitstress 20240701.1
Failed tests:
net9.0-linux-Release-x64-jitstress2-Ubuntu.2204.Amd64.Open
- System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNonCanonicalNaNTest
- System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNaNTest
net9.0-linux-Release-x64-tailcallstress-Ubuntu.2204.Amd64.Open
- System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNonCanonicalNaNTest
- System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNaNTest
net9.0-linux-Release-x64-jitstress1-Ubuntu.2204.Amd64.Open
- System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNonCanonicalNaNTest
- System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNaNTest
Error message:
Assert.True() Failure
Expected: True
Actual: False
Stack trace:
at System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNonCanonicalNaNTest() in /_/src/libraries/System.Runtime.Intrinsics/tests/Vectors/Vector512Tests.cs:line 5013
at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs:line 57
@v-wenyuxu The Vector512DoubleEqualsNaNTest is a completely different failure from the one tracked by this issue. Could you please open a new issue on it?
@VSadov Do you plan to work on this as part of your effort to improve reliability of tests that use GC.WaitForPendingFinalizers
under stress conditions?
I can take a look, but I think this should be a separate change. The reliability of stress PR changes only test-specific code and would not have effects on actual behavior of apps. The bar to the changes is somewhat lower. We can iterate on that further if we do not like the effects. We would not need specific testcases for that, since it is effectively just test infra changes.
What we have here is a problem with actual behavior of the feature. Ideally the fix is just removing the race, so it is unobservable to the user code, but the change would need a bit more careful. It may use a testcase as well. Perhaps something like https://github.com/dotnet/runtime/issues/104218#issuecomment-2201440026
I can take a look, but I think this should be a separate change.
Yes, I agree it should be a separate change.
Failed in: runtime-coreclr libraries-jitstress 20240702.1
Failed tests:
net9.0-windows-Release-x86-tailcallstress-Windows.10.Amd64.Open
- System.Collections.Concurrent.Tests.ConcurrentQueueTests.ReferenceTypes_NulledAfterDequeue
net9.0-windows-Release-x64-no_tiered_compilation-Windows.10.Amd64.Open
- System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNonCanonicalNaNTest
- System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNaNTest
Error message:
Assert.True() Failure
Expected: True
Actual: False
Stack trace:
at System.Collections.Concurrent.Tests.ConcurrentQueueTests.ReferenceTypes_NulledAfterDequeue()
at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
at System.Reflection.MethodBaseInvoker.InterpretedInvoke_Method(Object obj, IntPtr* args) in /_/src/coreclr/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.CoreCLR.cs:line 36
at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs:line 57
Here is a complete repro:
using System.Runtime.CompilerServices;
unsafe class KeepAliveNullTest
{
private static void Main()
{
Task.Run(Test);
Task.Run(Test);
Task.Run(Test);
Task.Run(Test);
Task.Run(Test);
Task.Run(Test);
Test();
}
private static void Test()
{
for (int i = 0; i < 20000000; i++)
{
bool finalized = false;
MakeAndNull(&finalized);
GC.Collect();
GC.WaitForPendingFinalizers();
if (!finalized) Console.WriteLine("FAIL1");
GC.Collect();
GC.WaitForPendingFinalizers();
if (!finalized) Console.WriteLine("FAIL2");
GC.Collect();
GC.WaitForPendingFinalizers();
if (!finalized) Console.WriteLine("FAIL3");
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static void MakeAndNull(bool* flag)
{
var deadObj = new TestObject(flag);
// it's dead here
}
public class TestObject
{
bool* _flag;
public TestObject(bool* flag)
{
_flag = flag;
}
~TestObject() => *_flag = true;
}
}
Prints:
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
Interesting: this does not repro on 8.0 Core and does not repro on 9.0 NativeAOT I only see this happening on CoreCLR 9.0
It is possible that it is just different timing.
Saw some FAIL1
failures on NativeAOT. A lot more rare than on CoreCLR, but it does happen. So it is just timing.
The fix in https://github.com/dotnet/runtime/pull/105289 works. No failures seen in either CoreCLR or NativeAOT compiled test.
Failed in: runtime-coreclr libraries-jitstressregs 20240630.1
Failed tests:
Error message:
Stack trace: