dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.05k stars 4.69k forks source link

Test failure: System.Tests.GCTests.KeepAlive_Null #104218

Closed v-wenyuxu closed 2 months ago

v-wenyuxu commented 3 months ago

Failed in: runtime-coreclr libraries-jitstressregs 20240630.1

Failed tests:

net9.0-linux-Release-x64-jitstressregs0x80-Ubuntu.2204.Amd64.Open
    - System.Tests.GCTests.KeepAlive_Null

Error message:

 Assert.True() Failure
Expected: True
Actual:   False

Stack trace:

   at System.Tests.GCTests.KeepAliveNullTest.Run() in /_/src/libraries/System.Runtime/tests/System.Runtime.Tests/System/GCTests.cs:line 211
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs:line 57
EgorBo commented 3 months ago

Doesn't look to be jitstress specific, fails on:

using System.Runtime.CompilerServices;
using Xunit;

class KeepAliveNullTest
{
    private static void Main()
    {
        for (int i = 0; i < 200000; i++)
        {
            TestObject.Finalized = false;
            MakeAndNull();
            GC.Collect();
            GC.WaitForPendingFinalizers();
            Assert.True(TestObject.Finalized);
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void MakeAndNull()
    {
        var deadObj = new TestObject();
        // it's dead here
    }

    public class TestObject
    {
        public static bool Finalized { get; set; }

        ~TestObject() => Finalized = true;
    }
}

with DOTNET_TieredCompilation=0. @dotnet/gc @VSadov any recent changes in GC could break it? It used to pass..

dotnet-policy-service[bot] commented 3 months ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

VSadov commented 3 months ago

Unless [MethodImpl(MethodImplOptions.NoInlining)] is not respected or the allocation of the object is somehow optimized away, I do not see how JIT or root reporting could cause any changes in behavior here.

This is a very simple case. The object gets unreachable and GC.Collect() should see that.

Could it be that NoInlining is not working?

EgorBo commented 3 months ago

Unless [MethodImpl(MethodImplOptions.NoInlining)] is not respected or the allocation of the object is somehow optimized away, I do not see how JIT or root reporting could cause any changes in behavior here.

This is a very simple case. The object gets unreachable and GC.Collect() should see that.

Could it be that NoInlining is not working?

I've just checked - both NoInlining is correctly respected and the allocation is not optimized away (we normally do remove such allocations unless their classes have finalizers). So sounds like some change in GC's behavior

EgorBo commented 3 months ago

Seems like the DATAS is the culprit, the test doesn't fail with DOTNET_GCDynamicAdaptationMode=0

jakobbotsch commented 3 months ago

My initial bisection suggests https://github.com/dotnet/runtime/compare/f69972fb744524a7d5345600e1b9b51c3fac736c...d2dbdd0bbbcb16cc53708a089e1b0388a57a5768, though the repro is intermittent so it's possible I just haven't seen it repro with f69972fb744524a7d5345600e1b9b51c3fac736c yet. In any case if that's the range #103501 looks most interesting, cc @jkotas

jkotas commented 3 months ago

I do not think that the problem is with GC not collecting the object. I think the problem is in the WaitForPendingFinalizers() synchronization. If I change the loop to be:

        for (int i = 0; i < 20000000; i++)
        {
            TestObject.Finalized = false;
            MakeAndNull();
            GC.Collect();
            GC.WaitForPendingFinalizers();
            if (!TestObject.Finalized) Console.WriteLine("FAIL");
            if (!TestObject.Finalized) Console.WriteLine("FAIL2");
         }

I never see FAIL + FAIL2 printed. I always only see FAIL printed. It suggests that the problem is in GC.WaitForPendingFinalizers returning too early before the finalizer got a chance to run.

The code has a comment about such race condition and considers it acceptable: https://github.com/dotnet/runtime/blob/104e88df9494dbb87b38da50fcdf0b4f09733743/src/coreclr/vm/finalizerthread.cpp#L409-L412

103501

This change made finalization loop more efficient. It makes any race conditions related to finalization more likely to show up.

EgorBo commented 3 months ago

The code has a comment about such race condition and considers it acceptable:

While it's understandable, I bet we have a ton of test cases here and there relying on WaitForPendingFinalizers calling the finalizers, effectively, synchronously (all of them are guarded by "is gc precise" flag to ignore mono). What I am saying is that we might see more flaky test failures like this then

VSadov commented 3 months ago

I think that race goes like:

But can this happen in a completely single-threaded test? (and as I understand happens fairly reliably now) Who causes "some unrelated GC happens"?

jkotas commented 3 months ago

Who causes "some unrelated GC happens"?

Background event source activity. There is a ton of it on typical machine.

VSadov commented 3 months ago

I have seen code where sequence Collect(); WaitForPendingFinalizers(); is always called twice (or three times) in a row. My thought was always - "why not 4 or 42 times? was N enough to pass tests?" :-)

VSadov commented 3 months ago

Perhaps the WFPF should check if the queue is empty and if not, wait on the event one more time before returning?

This API does not need to be very performant, but being less racey, if possible, could be good.

jkotas commented 3 months ago

Can we add some flag or counter to guarantee that GC.WaitForPendingFinalizers call always waits for full turn of the crank?

v-wenyuxu commented 3 months ago

Failed in: runtime-coreclr libraries-jitstress 20240701.1

Failed tests:

net9.0-linux-Release-x64-jitstress2-Ubuntu.2204.Amd64.Open
    - System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNonCanonicalNaNTest
    - System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNaNTest
net9.0-linux-Release-x64-tailcallstress-Ubuntu.2204.Amd64.Open
    - System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNonCanonicalNaNTest
    - System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNaNTest
net9.0-linux-Release-x64-jitstress1-Ubuntu.2204.Amd64.Open
    - System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNonCanonicalNaNTest
    - System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNaNTest

Error message:

 Assert.True() Failure
Expected: True
Actual:   False

Stack trace:

   at System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNonCanonicalNaNTest() in /_/src/libraries/System.Runtime.Intrinsics/tests/Vectors/Vector512Tests.cs:line 5013
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs:line 57
jkotas commented 3 months ago

@v-wenyuxu The Vector512DoubleEqualsNaNTest is a completely different failure from the one tracked by this issue. Could you please open a new issue on it?

jkotas commented 3 months ago

@VSadov Do you plan to work on this as part of your effort to improve reliability of tests that use GC.WaitForPendingFinalizers under stress conditions?

VSadov commented 3 months ago

I can take a look, but I think this should be a separate change. The reliability of stress PR changes only test-specific code and would not have effects on actual behavior of apps. The bar to the changes is somewhat lower. We can iterate on that further if we do not like the effects. We would not need specific testcases for that, since it is effectively just test infra changes.

What we have here is a problem with actual behavior of the feature. Ideally the fix is just removing the race, so it is unobservable to the user code, but the change would need a bit more careful. It may use a testcase as well. Perhaps something like https://github.com/dotnet/runtime/issues/104218#issuecomment-2201440026

jkotas commented 3 months ago

I can take a look, but I think this should be a separate change.

Yes, I agree it should be a separate change.

v-wenyuxu commented 3 months ago

Failed in: runtime-coreclr libraries-jitstress 20240702.1

Failed tests:

net9.0-windows-Release-x86-tailcallstress-Windows.10.Amd64.Open
    - System.Collections.Concurrent.Tests.ConcurrentQueueTests.ReferenceTypes_NulledAfterDequeue
net9.0-windows-Release-x64-no_tiered_compilation-Windows.10.Amd64.Open
    - System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNonCanonicalNaNTest
    - System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512DoubleEqualsNaNTest

Error message:

 Assert.True() Failure
Expected: True
Actual:   False

Stack trace:

   at System.Collections.Concurrent.Tests.ConcurrentQueueTests.ReferenceTypes_NulledAfterDequeue()
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InterpretedInvoke_Method(Object obj, IntPtr* args) in /_/src/coreclr/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.CoreCLR.cs:line 36
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs:line 57
VSadov commented 2 months ago

Here is a complete repro:

using System.Runtime.CompilerServices;

unsafe class KeepAliveNullTest
{
    private static void Main()
    {
        Task.Run(Test);
        Task.Run(Test);
        Task.Run(Test);
        Task.Run(Test);
        Task.Run(Test);
        Task.Run(Test);
        Test();
    }

    private static void Test()
    {
        for (int i = 0; i < 20000000; i++)
        {
            bool finalized = false;
            MakeAndNull(&finalized);
            GC.Collect();
            GC.WaitForPendingFinalizers();
            if (!finalized) Console.WriteLine("FAIL1");
            GC.Collect();
            GC.WaitForPendingFinalizers();
            if (!finalized) Console.WriteLine("FAIL2");
            GC.Collect();
            GC.WaitForPendingFinalizers();
            if (!finalized) Console.WriteLine("FAIL3");
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void MakeAndNull(bool* flag)
    {
        var deadObj = new TestObject(flag);
        // it's dead here
    }

    public class TestObject
    {
        bool* _flag;

        public TestObject(bool* flag)
        {
            _flag = flag;
        }

        ~TestObject() => *_flag = true;
    }
}

Prints:

FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
FAIL1
VSadov commented 2 months ago

Interesting: this does not repro on 8.0 Core and does not repro on 9.0 NativeAOT I only see this happening on CoreCLR 9.0

It is possible that it is just different timing.

VSadov commented 2 months ago

Saw some FAIL1 failures on NativeAOT. A lot more rare than on CoreCLR, but it does happen. So it is just timing.

VSadov commented 2 months ago

The fix in https://github.com/dotnet/runtime/pull/105289 works. No failures seen in either CoreCLR or NativeAOT compiled test.