dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

mono gc not set static field back after object moved when enable interp mode #100065

Closed srxqds closed 5 months ago

srxqds commented 7 months ago

Description

We are embeding mono in UnrealEngine, when we upgrade .ne8.0(v8.0.2), we found sgen-gc will not set static field back after object move when enable hotreloadcomponent.

Reproduction Steps

We have not separate code to reproduct it, but I will describe it as detailed as possible.

we have restore object in static field, when next pass it to internal call the object address after the gc have move it, the value not set back. I show the code first:

public partial class FTicker : ScriptingWrapperOfType
{
    public static FTicker Internal_GetCoreTicker_Cache = null;
    [MethodImpl(MethodImplOptions.InternalCall)]
    private static extern FTicker Internal_GetCoreTicker();

    public static FTicker GetCoreTicker() 
    {
        if(Internal_GetCoreTicker_Cache == null)
        {
            Internal_GetCoreTicker_Cache = Internal_GetCoreTicker();
        }

        return Internal_GetCoreTicker_Cache;
    }

    [MethodImpl(MethodImplOptions.InternalCall)]
    public static extern void TearDownCoreTicker();

    [MethodImpl(MethodImplOptions.InternalCall)]
    public extern FDelegateHandle AddTicker(TDelegate__float_Ret_bool InDelegate, float InDelay = 0.0f);
}

the internal call

static FDelegateHandle_Proxy Internal_FTicker_AddTicker(MonoObject* self, MonoDelegate* InDelegate, float InDelay)
{
        // these code is use for debug this problem
    MonoImage* EngineImage = GetScriptingAssemblyManager().FindImage("UnrealEngine");
    if (EngineImage)
    {
        MonoClass* TickerClass = mono_class_from_name(EngineImage, "UnrealEngine", "FTicker");
        if (TickerClass)
        {
            MonoClassField* Internal_GetCoreTicker_Cache = mono_class_get_field_from_name(TickerClass, "Internal_GetCoreTicker_Cache");
            if (Internal_GetCoreTicker_Cache)
            {
                MonoObject* data = nullptr;
                MONO_ENTER_GC_UNSAFE
                mono_field_static_get_value(mono_class_vtable(mono_domain_get(), TickerClass), Internal_GetCoreTicker_Cache, &data);
                MONO_EXIT_GC_UNSAFE
            }
        }
    }
        // below is the real logic
    TRACE_INTERALCALL_PROFILER_EVENT_SCOPE
    TDelegate<bool(float)> InDelegate__Temp;
    BindScriptIfException<bool, float>(InDelegate__Temp, InDelegate, true);
    return FDelegateHandle_Proxy::Proxy(ReadonlyScriptingObjectConverter<FTicker>(self)->AddTicker(InDelegate__Temp, InDelay));
}

the code logic execute as below :

tickHandle = FTicker.GetCoreTicker().AddTicker(OnTick);
// gc occured
tickHandle1 = FTicker.GetCoreTicker().AddTicker(OnTick1);

// but the second call `Internal_FTicker_AddTicker` the `self` object the vtable have set null and then can't get the object class. 

I show the behaviors with hotreload component enable or not.

if we not enable hotreload component, just use jit execute mode, it work correctly:

the first call AddTicker, we can checkout the self object:

image

and we check the Internal_GetCoreTicker_Cache static field:

image

the second call the self object address have changed:

image

if we enable hotreload component, the static field will not write back: the first call:

image

the second call: the self object address have not been changed but the vtable is null

image

Expected behavior

work correctly

Actual behavior

crash

Regression?

.net7.0 work correctly

Known Workarounds

no

Configuration

windows build v8.0.2 tag set excute mode mono_jit_set_aot_mode(MONO_AOT_MODE_INTERP_ONLY);

Other information

No response

srxqds commented 7 months ago

@lateralusX @BrzVlad can you help to check these problem?

BrzVlad commented 7 months ago

This looks like when hot reload is enabled, the roots from the static fields of that class are not being registered at all to the GC. @lambdageek Do you understand what might be going on here ?

srxqds commented 7 months ago

yeah, can you tell me where register static field to roots or scan static field

BrzVlad commented 7 months ago

https://github.com/dotnet/runtime/blob/e12e2fa6cbdd1f4b0c8ad1b1e2d960a480c21703/src/mono/mono/metadata/object.c#L2231 is where we allocate memory for the static fields. If some of those fields are refs, we allocate the memory with mono_gc_alloc_fixed which registers the memory as root, together with a gc descriptor for which slots are to be scanned. During gc, they are then scanned here https://github.com/dotnet/runtime/blob/e12e2fa6cbdd1f4b0c8ad1b1e2d960a480c21703/src/mono/mono/sgen/sgen-gc.c#L1288

I'm not familiar with the implications of hot reload over these static fields.

srxqds commented 7 months ago

thank you @BrzVlad waiting for @lambdageek for help, or anyone can help me to reproduce it?

lambdageek commented 7 months ago

All you do is just enable or disable the hot reload component? You're not actually applying any changes?

What happens if you disable hot reload, but set the execution mode to mono_jit_set_aot_mode(MONO_AOT_MODE_INTERP_ONLY); ?

srxqds commented 7 months ago

You're not actually applying any changes?

yes, I have not apply any change

What happens if you disable hot reload, but set the execution mode to mono_jit_set_aot_mode(MONO_AOT_MODE_INTERP_ONLY); ?

the problem also exists.

@lambdageek

srxqds commented 7 months ago

anyone can take a look at this question

srxqds commented 7 months ago

we have comment out this line https://github.com/dotnet/runtime/blob/0f6df43ff9bfb98f0fcac781125ce9b4e5b5ba66/src/mono/mono/metadata/assembly-load-context.c#L43

but the alc is not unloading.

and the second clue is the FTicker Internal_GetCoreTicker_Cache object have been gc, I have found call the destructor function it should introduce this problem. but why it work on jit, but crash when interp enable.

@vargaz can you help me?

srxqds commented 7 months ago

we found that call mono_gc_walk_heap can get the same FTicker object after the destructor has been called.

the destructor call FTicker destructor 000001711BD677B8 //call mono_gc_walk_heap ObjectPtr:000001711BD677B8 Type:UnrealEngine.FTicker (target)

srxqds commented 7 months ago

how can we fixed it, our project is blocking by this problem. anyone can help me, because the gc is diffcult to debug.

srxqds commented 7 months ago

I have set gc_level=9, because the log file is too big for upload, I paste the log about the FTicker object:

in jit mode:

2024-04-01 22:13:13 Allocated object 0000019369C01030, vtable: 000001931E396508 (FTicker), size: 40d
2024-04-01 22:13:16 Wbarrier store at 0000001DFAD79048 to 0000019369C01030 (FTicker)
2024-04-01 22:13:16 Adding remset at 0000001DFAD79048

2024-04-01 22:14:16 Added finalizer for object: 000001936B061D00 (Gen2GcCallback) (63) to old table
2024-04-01 22:14:16 Added finalizer for object: 0000019369C01030 (FTicker) (4) to nursery table
2024-04-01 22:14:16 Added finalizer for object: 0000019369C010D8 (DemoInputProcessor) (5) to nursery table

2024-04-01 22:14:16 Adding global remset for 0000019365C53498
2024-04-01 22:14:16 Precise copy of 0000019369C01030 from 000001936B0946A0
2024-04-01 22:14:16 Overwrote field at 000001936B0946A0 with 0000019302403600 (was: 0000019369C01030)
2024-04-01 22:14:16 Precise copy of 0000019369C14ED0 from 000001936B094BC0

2024-04-01 22:14:16 Precise copy of 0000019369C01030 from 0000001DFAD75B78
2024-04-01 22:14:16  (already forwarded to 0000019302403600)
2024-04-01 22:14:16 Promoting finalization of object 0000019302403600 (FTicker) (was at 0000019369C01030) to major table
2024-04-01 22:14:16 Precise scan of gray area post fin

in interp mode:

2024-04-01 22:18:08 Allocated object 000002D52D801030, vtable: 000002D56694DC88 (FTicker), size: 40d
2024-04-01 22:18:08 Allocated object 000002D52D801058, vtable: 000002D56694A320 (TDelegate__float_Ret_bool), size: 128d
2024-04-01 22:18:08 Wbarrier store at 000000E425B776A8 to 000002D52D801030 (FTicker)
2024-04-01 22:18:08 Adding remset at 000000E425B776A8

2024-04-01 22:18:26 Added finalizer for object: 000002D52ECDDD00 (Gen2GcCallback) (63) to old table
2024-04-01 22:18:26 Added finalizer for object: 000002D52D801030 (FTicker) (4) to nursery table
2024-04-01 22:18:26 Added finalizer for object: 000002D52D8010D8 (DemoInputProcessor) (5) to nursery table

2024-04-01 22:18:26 Precise copy of 000002D52D801030 from 000000E425B74838
2024-04-01 22:18:26 Queueing object for finalization: 000002D51E6CB790 (FTicker) (was at 000002D52D801030) (14)
2024-04-01 22:18:26 Precise copy of 000002D52D881360 from 000000E425B74838

// the new FTicker object has been Finalizing 
2024-04-01 22:18:26 Wbarrier store at 000000E43C28F780 to 0000000000000000 (null)
2024-04-01 22:18:26 Finalizing object 000002D51E6CB790 (FTicker)
2024-04-01 22:18:26 Wbarrier store at 000002D52ED70418 to 0000000000000000 (null)

// below use old object address
2024-04-01 22:19:04 Wbarrier store at 000000E425B75AA8 to 000002D52D801030 (�`ls�)
2024-04-01 22:19:04 Adding remset at 000000E425B75AA8
2024-04-01 22:19:04 Wbarrier store at 000000E425B75AB0 to 000002D52DA1D5A0 (TDelegate__float_Ret_bool)
2024-04-01 22:19:04 Adding remset at 000000E425B75AB0
2024-04-01 22:19:32 Wbarrier store at 000000E425B75718 to 000002D52D801030 (�`ls�)
srxqds commented 7 months ago

I have modified MINT_STSFLD_O opcode with mono_gc_wbarrier_generic_store_internal seems solve the problem: image

but I have hit another problem: the generic method out parameter pass also have gc object error.

srxqds commented 7 months ago

the second crash show below: image

and the source code is : image

BrzVlad commented 7 months ago

Interesting. So it turns out we actually can allocate static field storage as part of another object when using hot reload. (https://github.com/dotnet/runtime/blob/main/src/mono/mono/component/hot_reload.c#L3092). The interpreter was not aware of this so it doesn't do write barriers when storing to static fields. My understanding is that you still encounter this crash when having hot reload enabled (not just by using mono interpreter) ?

If static field storage is part of another object then you are correct that we need write barriers when we store into it. MINT_STSFLD_O and also MINT_STSFLD_VT would need write barriers. For the latter you would need to either pass the field_klass to the opcode so you can use mono_value_copy_internal or hack a call to sgen_card_table_mark_range for the area that is being written.

Also, from the debugger screenshot, the bad object 0x1f4f1dee470 has its vtable 0x1f4f309df01. Because the vtable has its least significant bit 1, it is a sign that the value of the vtable is actually the location of the new object. This means that the object from the crash was promoted to location 0x1f4f309df00. You could investigate the type of object if it is something out of the ordinary.

In this type of crashes, it would be useful to run with MONO_GC_DEBUG=check-remset-consistency. It should point to locations where there are missing cards.

srxqds commented 7 months ago

thank you, I will continue to try tomorrow. yeah, we have enable hot reload component or apply any hot reload delta.

srxqds commented 7 months ago

Interesting. So it turns out we actually can allocate static field storage as part of another object when using hot reload. (https://github.com/dotnet/runtime/blob/main/src/mono/mono/component/hot_reload.c#L3092). The interpreter was not aware of this so it doesn't do write barriers when storing to static fields. My understanding is that you still encounter this crash when having hot reload enabled (not just by using mono interpreter) ?

If static field storage is part of another object then you are correct that we need write barriers when we store into it. MINT_STSFLD_O and also MINT_STSFLD_VT would need write barriers. For the latter you would need to either pass the field_klass to the opcode so you can use mono_value_copy_internal or hack a call to sgen_card_table_mark_range for the area that is being written.

Also, from the debugger screenshot, the bad object 0x1f4f1dee470 has its vtable 0x1f4f309df01. Because the vtable has its least significant bit 1, it is a sign that the value of the vtable is actually the location of the new object. This means that the object from the crash was promoted to location 0x1f4f309df00. You could investigate the type of object if it is something out of the ordinary.

In this type of crashes, it would be useful to run with MONO_GC_DEBUG=check-remset-consistency. It should point to locations where there are missing cards.

when I set MONO_GC_DEBUG=check-remset-consistency, there are many object not set in ImmutableArray

2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D06778 at offset 16 in object 000001679F4AC370 (System.Collections.Immutable.ImmutableArray`1) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4AC388, vtable: 00000167AD7176F0 (ImmutableArray`1)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D25860 at offset 16 in object 000001679F4AC388 (System.Collections.Immutable.ImmutableArray`1) not found in remsets, but object is pinned.
2024-04-03 09:37:38 Scanning object 000001679F4AC3A0, vtable: 00000167AD8388F0 (ImmutableArray`1)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D31200 at offset 16 in object 000001679F4AC3A0 (System.Collections.Immutable.ImmutableArray`1) not found in remsets, but object is pinned.
2024-04-03 09:37:38 Scanning object 000001679F4AC3B8, vtable: 00000167AD834CB8 (ImmutableArray`1)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D30A50 at offset 16 in object 000001679F4AC3B8 (System.Collections.Immutable.ImmutableArray`1) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4AC3D0, vtable: 00000167AD835E30 (ImmutableArray`1)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D25E50 at offset 16 in object 000001679F4AC3D0 (System.Collections.Immutable.ImmutableArray`1) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4AC3E8, vtable: 00000167AD8A0230 (ImmutableArray`1)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D2F3D0 at offset 16 in object 000001679F4AC3E8 (System.Collections.Immutable.ImmutableArray`1) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4AC400, vtable: 00000167AD9564F0 (ImmutableSegmentedHashSet`1)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D2F698 at offset 16 in object 000001679F4AC400 (Microsoft.CodeAnalysis.Collections.ImmutableSegmentedHashSet`1) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4AC418, vtable: 00000167AD9D0830 (ImmutableArray`1)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D30618 at offset 16 in object 000001679F4AC418 (System.Collections.Immutable.ImmutableArray`1) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4AC430, vtable: 00000167AD9DF6A0 (ImmutableArray`1)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D30758 at offset 16 in object 000001679F4AC430 (System.Collections.Immutable.ImmutableArray`1) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4AC448, vtable: 00000167C566C978 (ImmutableArray`1)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D33BE0 at offset 16 in object 000001679F4AC448 (System.Collections.Immutable.ImmutableArray`1) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4AC460, vtable: 00000167ADB89E50 (ImmutableArray`1)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D33C08 at offset 16 in object 000001679F4AC460 (System.Collections.Immutable.ImmutableArray`1) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4AC478, vtable: 00000167D0DA51C0 (ImmutableArray`1)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D33D78 at offset 16 in object 000001679F4AC478 (System.Collections.Immutable.ImmutableArray`1) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4AC490, vtable: 00000167ADC18240 (ImmutableArray`1)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D48B48 at offset 16 in object 000001679F4AC490 (System.Collections.Immutable.ImmutableArray`1) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4AC4A8, vtable: 00000167ADC2B050 (ImmutableArray`1)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795D48D70 at offset 16 in object 000001679F4AC4A8 (System.Collections.Immutable.ImmutableArray`1) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4A8130, vtable: 00000167ACEF6E18 (SyntaxTrivia)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795CBA0A0 at offset 40 in object 000001679F4A8130 (Microsoft.CodeAnalysis.SyntaxTrivia) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4A8168, vtable: 00000167ACEF6E18 (SyntaxTrivia)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795CBA628 at offset 40 in object 000001679F4A8168 (Microsoft.CodeAnalysis.SyntaxTrivia) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4A81A0, vtable: 00000167ACEF6E18 (SyntaxTrivia)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795CBA648 at offset 40 in object 000001679F4A81A0 (Microsoft.CodeAnalysis.SyntaxTrivia) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4A81D8, vtable: 00000167ACEF6E18 (SyntaxTrivia)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795CBA668 at offset 40 in object 000001679F4A81D8 (Microsoft.CodeAnalysis.SyntaxTrivia) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4A8210, vtable: 00000167ACEF6E18 (SyntaxTrivia)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795CBA688 at offset 40 in object 000001679F4A8210 (Microsoft.CodeAnalysis.SyntaxTrivia) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4A8248, vtable: 00000167ACEF6E18 (SyntaxTrivia)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795CBB0B8 at offset 40 in object 000001679F4A8248 (Microsoft.CodeAnalysis.SyntaxTrivia) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4A8280, vtable: 00000167ACEF6E18 (SyntaxTrivia)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795CBB198 at offset 40 in object 000001679F4A8280 (Microsoft.CodeAnalysis.SyntaxTrivia) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4A82B8, vtable: 00000167ACEF6E18 (SyntaxTrivia)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795CBB278 at offset 40 in object 000001679F4A82B8 (Microsoft.CodeAnalysis.SyntaxTrivia) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4A82F0, vtable: 00000167ACEF6E18 (SyntaxTrivia)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795CBB358 at offset 40 in object 000001679F4A82F0 (Microsoft.CodeAnalysis.SyntaxTrivia) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4A8328, vtable: 00000167ACEF6E18 (SyntaxTrivia)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795CBB438 at offset 40 in object 000001679F4A8328 (Microsoft.CodeAnalysis.SyntaxTrivia) not found in remsets.
2024-04-03 09:37:38 Scanning object 000001679F4A8360, vtable: 00000167ACEF6E18 (SyntaxTrivia)
2024-04-03 09:37:38 Oldspace->newspace reference 0000016795CBB520 at offset 40 in object 000001679F4A8360 (Microsoft.CodeAnalysis.SyntaxTrivia) not found in remsets.

so this set may not useful, and the 0x1f4f309df00 class is System.Object so it doesn't provide much useful information for me.

srxqds commented 7 months ago

the ImmutableArray<T> is struct so

public static class ImmutableArray
    {
        /// <summary>
        /// A two element array useful for throwing exceptions the way LINQ does.
        /// </summary>
        internal static readonly byte[] TwoElementArray = new byte[2];

        /// <summary>
        /// Creates an empty <see cref="ImmutableArray{T}"/>.
        /// </summary>
        /// <typeparam name="T">The type of element stored in the array.</typeparam>
        /// <returns>An empty array.</returns>
        public static ImmutableArray<T> Create<T>()
        {
            return ImmutableArray<T>.Empty;
        }
BrzVlad commented 7 months ago

The logs support the existing theory that the issue is only with storing into static fields. Am I assuming correctly that you didn't tweak MINT_STSFLD_VT to emit write barriers ? It could look something like this:

diff --git a/src/mono/mono/mini/interp/interp.c b/src/mono/mono/mini/interp/interp.c
index f8e2ad02ad9..7b46b27fbcd 100644
--- a/src/mono/mono/mini/interp/interp.c
+++ b/src/mono/mono/mini/interp/interp.c
@@ -6325,8 +6325,9 @@ MINT_IN_CASE(MINT_BRTRUE_I8_SP) ZEROP_SP(gint64, !=); MINT_IN_BREAK;
                MINT_IN_CASE(MINT_STSFLD_VT) {
                        MonoVTable *vtable = (MonoVTable*) frame->imethod->data_items [ip [2]];
                        INIT_VTABLE (vtable);
+                       MonoClass *klass = (MonoClass*) frame->imethod->data_items [ip [4]];
                        gpointer addr = frame->imethod->data_items [ip [3]];
-                       memcpy (addr, locals + ip [1], ip [4]);
+                       mono_value_copy_internal (addr, locals + ip [1], klass);
                        ip += 5;
                        MINT_IN_BREAK;
                }
diff --git a/src/mono/mono/mini/interp/transform.c b/src/mono/mono/mini/interp/transform.c
index bdf48b59f03..68e6bad19e1 100644
--- a/src/mono/mono/mini/interp/transform.c
+++ b/src/mono/mono/mini/interp/transform.c
@@ -4635,8 +4635,12 @@ interp_emit_sfld_access (TransformData *td, MonoClassField *field, MonoClass *fi
                if (G_LIKELY (!wide_data)) {
                        td->last_ins->data [0] = (guint16) vtable_index;
                        td->last_ins->data [1] = (guint16) addr_index;
-                       if (mt == MINT_TYPE_VT)
-                               td->last_ins->data [2] = GINT_TO_UINT16 (size);
+                       if (mt == MINT_TYPE_VT) {
+                               if (is_load)
+                                       td->last_ins->data [2] = GINT_TO_UINT16 (size);
+                               else
+                                       td->last_ins->data [2] = (guint16) get_data_item_index (td, field_class);
+                       }
                } else {
                        WRITE32_INS (td->last_ins, 0, &vtable_index);
                        WRITE32_INS (td->last_ins, 2, &addr_index);
srxqds commented 7 months ago

ot found in remsets

this change works, thank you @BrzVlad

but but why won't happen when not enable collectiable alc?