dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.46k stars 4.76k forks source link

[mono] bug about sgen `check-remset-consistency` check #109215

Open srxqds opened 1 month ago

srxqds commented 1 month ago

As we discuss in this issues at https://github.com/dotnet/runtime/issues/85318#issuecomment-1533920196. we use check-remset-consistency debug option for gc problem, we always hit the problem which seems not caused by our source code.

2024-10-25 15:50:41 Begin heap consistency check...
2024-10-25 15:50:41 Oldspace->newspace reference 0000024F03CDE4D0 at offset 1744 in object 0000024F68424130 (.Entry[]) not found in remsets.
2024-10-25 15:50:41 Oldspace->newspace reference 0000024F03CDED68 at offset 1752 in object 0000024F68424130 (.Entry[]) not found in remsets.
2024-10-25 15:50:41 Heap consistency check done.

I have debug view the address 0000024F68424130 which is the dictionary field entry[], the value type is our type SectionTable, and the address 0000024F03CDE4D0 and 0000024F03CDED68 object type is int32[].

our source code show below:

public class PlainBufferManager<TID>
    where TID : IComparable
{
    public struct SectionTable
    {
        public int itemCount;
        public int maxSection;
        public int headerSize;
        public Func<PlainBuffer, int, TID> idGetter;
        public string tableClassName;
        public PlainBuffer[] loadedSectionBuffers;
        public int[] sectionItemCount;
        public SectionTable(int maxSection, int headerSize, Func<PlainBuffer, int, TID> idGetter, string tableClassName, ref PlainBuffer[] rootBufs)
        {
            this.maxSection = maxSection;
            this.headerSize = headerSize;
            this.idGetter = idGetter;
            this.tableClassName = tableClassName;
            loadedSectionBuffers = rootBufs;
            sectionItemCount = new int[maxSection];
            int sectionIndex = 0;
            foreach(PlainBuffer buffer in rootBufs)
            {
                int tempItemCount = buffer.GetInt(0);
                this.itemCount += tempItemCount;
                sectionItemCount[sectionIndex++] = tempItemCount;
            }
        }
    }

    static PlainBufferManager()
    {
        BufferUtils.OnUnload += Unload;
        MultiSectionTableManager.OnSwitchSection += OnSwitchSection;
    }

    public static Dictionary<int, SectionTable> SectionTables { get; private set; } = new Dictionary<int, SectionTable>();

}

two problems:

  1. why the dictionary SectionTables cause the problem remsets.
  2. how can we debug the gc prolem, it is blocked by this solution?
srxqds commented 1 month ago

@BrzVlad @lateralusX can you help me?

any solution can work around?

srxqds commented 1 month ago

and the crash stacktrace:

 0  libmonosgen-2.0.so!copy_object_no_checks [sgen-copy-object.h : 69 + 0x0]
     x0 = 0x000000763c3774b0    x1 = 0x0000007682016920
     x2 = 0x0000007682016920    x3 = 0x0000000000000377
     x4 = 0x0000000000000800    x5 = 0x00000075a9151260
     x6 = 0x0000007682016720    x7 = 0x0000007631354000
     x8 = 0x000000763c000000    x9 = 0x000000000000c007
    x10 = 0x0000000000000000   x11 = 0x0000000000000000
    x12 = 0x0000000000000138   x13 = 0x0000000000000019
    x14 = 0x000000002cb03dee   x15 = 0x00002f82bd2e6a06
    x16 = 0x000000767c150448   x17 = 0x00000077c073ddc0
    x18 = 0xffffffffffffffda   x19 = 0x0000007682016920
    x20 = 0x000000763c3774b0   x21 = 0x0000000000000000
    x22 = 0x000000763c3774b0   x23 = 0x000000000000003a
    x24 = 0xffffffffffffffff   x25 = 0x000000767c16c010
    x26 = 0x0000000000000000   x27 = 0x0000000000000001
    x28 = 0x000000767c16b000    fp = 0x0000007682016600
     lr = 0x000000767c120c10    sp = 0x0000007682016600
     pc = 0x000000767c1362ac
    Found by: given as instruction pointer in context
 1  libmonosgen-2.0.so!major_scan_object_with_evacuation [sgen-scan-object.h : 66 + 0x4]
     fp = 0x0000007682016690    lr = 0x000000767c120c10
     sp = 0x0000007682016610    pc = 0x000000767c120c10
    Found by: previous frame's frame pointer
 2  libmonosgen-2.0.so!major_scan_object_with_evacuation [sgen-scan-object.h : 66 + 0x4]
     fp = 0x0000007682016740    lr = 0x000000767c122d54
     sp = 0x00000076820166a0    pc = 0x000000767c120c10
    Found by: previous frame's frame pointer
 3  libmonosgen-2.0.so!drain_gray_stack [sgen-marksweep.c : 1285 + 0x0]
     fp = 0x0000007682016840    lr = 0x000000767c110b30
     sp = 0x0000007682016750    pc = 0x000000767c122d54
    Found by: previous frame's frame pointer
 4  libmonosgen-2.0.so!finish_gray_stack [sgen-gc.c : 1140 + 0x4]
     fp = 0x00000076820168c0    lr = 0x000000767c111604
     sp = 0x0000007682016850    pc = 0x000000767c110b30
    Found by: previous frame's frame pointer
 5  libmonosgen-2.0.so!major_finish_collection [sgen-gc.c : 2323 + 0xc]
     fp = 0x0000007682016970    lr = 0x000000767c11039c
     sp = 0x00000076820168d0    pc = 0x000000767c111604
    Found by: previous frame's frame pointer
 6  libmonosgen-2.0.so!major_do_collection [sgen-gc.c : 2465 + 0x14]
     fp = 0x0000007682016a50    lr = 0x000000767c10c204
     sp = 0x0000007682016980    pc = 0x000000767c11039c
    Found by: previous frame's frame pointer
 7  libmonosgen-2.0.so!sgen_perform_collection [sgen-gc.c : 2762 + 0xc]
     fp = 0x0000007682016ab0    lr = 0x000000767c10d198
     sp = 0x0000007682016a60    pc = 0x000000767c10c204
    Found by: previous frame's frame pointer
 8  libmonosgen-2.0.so!sgen_gc_collect [sgen-gc.c : 3214 + 0x18]
     fp = 0x0000007682016ae0    lr = 0x000000767c0f2d38
     sp = 0x0000007682016ac0    pc = 0x000000767c10d198
    Found by: previous frame's frame pointer
 9  libmonosgen-2.0.so!mono_gc_collect [sgen-mono.c : 2359 + 0x4]
dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @brzvlad See info in area-owners.md if you want to be subscribed.

BrzVlad commented 1 month ago

In the provided sample there isn't any code modifying the dictionary. Where does this happen ? Is the dictionary modified from unmanaged code ? Is SectionTable modified from unmanaged code ?

srxqds commented 1 month ago

In the provided sample there isn't any code modifying the dictionary. Where does this happen ? Is the dictionary modified from unmanaged code ? Is SectionTable modified from unmanaged code ?

No. All modified is from managed code

BrzVlad commented 1 month ago

Unfortunately, in this scenario, it appears that check-remset-consistency reports a false positive, which is awkward to fix. The reason is that, in a valuetype array where an element spans over multiple cards, it is enough to mark a single card for this element. This marked card will guarantee that the entire valuetype will be scanned. The remset consistency check doesn't take this into account so it can report false complaints.

srxqds commented 1 month ago

Unfortunately, in this scenario, it appears that check-remset-consistency reports a false positive, which is awkward to fix. The reason is that, in a valuetype array where an element spans over multiple cards, it is enough to mark a single card for this element. This marked card will guarantee that the entire valuetype will be scanned. The remset consistency check doesn't take this into account so it can report false complaints.

so, hope you can fix this problem, This issue has a great impact on us!

srxqds commented 1 month ago

we want to use check-remset-consistency check the wrong place, but now it can't be used

BrzVlad commented 1 month ago

Why would you want to use this option ? It is just a debug option to potentially help investigate issues, it is not meant to be used in production.

srxqds commented 1 month ago

Why would you want to use this option ? It is just a debug option to potentially help investigate issues, it is not meant to be used in production.

Yeah,we hit the crash on shipping app,we need use this debug option in development for check.

srxqds commented 1 month ago

Otherwise we can't figure out the issue

BrzVlad commented 1 month ago

There is another gc debug option: MONO_GC_DEBUG=binary-protocol=output-file. If you have this enabled in addition to check-remset-consistency, it will not crash but rather just report the found issues. You could then check the console output to see if there are any other missing remsets aside from these false positives. (The false positives should be easy to identify because they happen in a array of valuetypes where the missing card is for an address at a card boundary. In your example, the missing card is for an address at 0x24F68424130 + 1744 which is 0x24F68424800, right at a card boundary; a card is for a region of 0x200 bytes)

srxqds commented 1 month ago

ok,another quesion what the binary-protocol use?

srxqds commented 1 month ago

There is another gc debug option: MONO_GC_DEBUG=binary-protocol=output-file. If you have this enabled in addition to check-remset-consistency, it will not crash but rather just report the found issues. You could then check the console output to see if there are any other missing remsets aside from these false positives. (The false positives should be easy to identify because they happen in a array of valuetypes where the missing card is for an address at a card boundary. In your example, the missing card is for an address at 0x24F68424130 + 1744 which is 0x24F68424800, right at a card boundary; a card is for a region of 0x200 bytes)

yes, this setting can avoid crash ,but we can't found the check-remset-consistency wrong, only output

srxqds commented 1 month ago

we also found another false case:

2024-10-30 17:49:49 Oldspace->newspace reference 00000157842FC390 at offset 96 in object 00000157BD2697B0 (.StateMachineBox`1) not found in remsets.
2024-10-30 17:49:49 Oldspace->newspace reference 00000157842ED788 at offset 120 in object 00000157BD2697B0 (.StateMachineBox`1) not found in remsets, but object is pinned.

so hope you can fix it.

BrzVlad commented 4 weeks ago

we also found another false case:

2024-10-30 17:49:49 Oldspace->newspace reference 00000157842FC390 at offset 96 in object 00000157BD2697B0 (.StateMachineBox`1) not found in remsets.
2024-10-30 17:49:49 Oldspace->newspace reference 00000157842ED788 at offset 120 in object 00000157BD2697B0 (.StateMachineBox`1) not found in remsets, but object is pinned.

so hope you can fix it.

This looks like a real bug, but I'm not able to tell what is going on just from this log. Would you be able to run your application and obtain such reports with a custom build runtime that has additional logging included ?

srxqds commented 4 weeks ago

gc-debug.log

this is gc-debug log

yes, the runtime we used building by ourself.

the StateMachineBox is the class defined in bcl not our code

BrzVlad commented 4 weeks ago

Is this crash happening also on android ? Some of the other reports could also be due to gc issues. I think it is best to follow up on this check remset consistency report.

Could you attempt to obtain the same remset consistency failure with the following diff:

diff --git a/src/mono/mono/sgen/sgen-debug.c b/src/mono/mono/sgen/sgen-debug.c
index ea2caeba2b6..c9f1f27a7d8 100644
--- a/src/mono/mono/sgen/sgen-debug.c
+++ b/src/mono/mono/sgen/sgen-debug.c
@@ -163,8 +163,11 @@ static gboolean missing_remsets;
                                gboolean is_pinned = object_is_pinned (*(ptr)); \
                                SGEN_LOG (0, "Oldspace->newspace reference %p at offset %ld in object %p (%s.%s) not found in remsets%s.", *(ptr), (long)((char*)(ptr) - (char*)(obj)), (obj), sgen_client_vtable_get_namespace (__vt), sgen_client_vtable_get_name (__vt), is_pinned ? ", but object is pinned" : ""); \
                                sgen_binary_protocol_missing_remset ((obj), __vt, (int) ((char*)(ptr) - (char*)(obj)), *(ptr), (gpointer)LOAD_VTABLE(*(ptr)), is_pinned); \
-                               if (!is_pinned)                         \
+                               if (!is_pinned) {                       \
+                                       mono_object_describe_fields (obj); \
+                                       mono_object_describe ((MonoObject*)*ptr); \
                                        missing_remsets = TRUE;         \
+                               }                                       \
                        }                                               \
                }                                                       \
        } while (0)

It might help explain why is there a missing wbarrier when storing into an StateMachineBox

srxqds commented 4 weeks ago

Is this crash happening also on android ? Some of the other reports could also be due to gc issues. I think it is best to follow up on this check remset consistency report.

Could you attempt to obtain the same remset consistency failure with the following diff:

diff --git a/src/mono/mono/sgen/sgen-debug.c b/src/mono/mono/sgen/sgen-debug.c
index ea2caeba2b6..c9f1f27a7d8 100644
--- a/src/mono/mono/sgen/sgen-debug.c
+++ b/src/mono/mono/sgen/sgen-debug.c
@@ -163,8 +163,11 @@ static gboolean missing_remsets;
                                gboolean is_pinned = object_is_pinned (*(ptr)); \
                                SGEN_LOG (0, "Oldspace->newspace reference %p at offset %ld in object %p (%s.%s) not found in remsets%s.", *(ptr), (long)((char*)(ptr) - (char*)(obj)), (obj), sgen_client_vtable_get_namespace (__vt), sgen_client_vtable_get_name (__vt), is_pinned ? ", but object is pinned" : ""); \
                                sgen_binary_protocol_missing_remset ((obj), __vt, (int) ((char*)(ptr) - (char*)(obj)), *(ptr), (gpointer)LOAD_VTABLE(*(ptr)), is_pinned); \
-                               if (!is_pinned)                         \
+                               if (!is_pinned) {                       \
+                                       mono_object_describe_fields (obj); \
+                                       mono_object_describe ((MonoObject*)*ptr); \
                                        missing_remsets = TRUE;         \
+                               }                                       \
                        }                                               \
                }                                                       \
        } while (0)

It might help explain why is there a missing wbarrier when storing into an StateMachineBox

yes,but I open debug on windows. thank you i will try later

BrzVlad commented 2 weeks ago

@srxqds Have you tried obtaining additional information about these crashes with the above diff ?

srxqds commented 2 weeks ago

@srxqds Have you tried obtaining additional information about these crashes with the above diff ?

i am sorry, The project was canceled before I continue to look this problem, the project has developed for 5 years.

so these crash is so big impact for our project.

srxqds commented 4 days ago

Because crashes are often difficult to reproduce and unpredictable, I want to find out the details of internal call pass object argument.

as the Array.Mono.cs show use ObjectHandleOnStack pass the object at https://github.com/dotnet/runtime/blob/354ec46a63440608bda18e2203bb5538e2f8eae6/src/mono/System.Private.CoreLib/src/System/Array.Mono.cs#L426

we found this note below the code:

// Do not change this to call GetGenericValue_icall directly, due to special casing in the runtime.

but we use the mono_add_internal_call method add internal call pass the object directly. Will it be a hidden danger?

our code show below:

class FGenericArray
{
[MethodImpl(MethodImplOptions.InternalCall)]
private extern int Internal_Add<T>(ref T Item);

protected int Add<T>(ref T Item) 
{
    return Internal_Add(ref Item);
}
}

// we use `mono_value_copy` or `mono_gc_wbarrier_generic_store` set the item value when the element type is class or struct has reference type.
static int32 Internal_FGenericArray_Add(MonoObject* self, void* Item)
{
    return FGenericArray::Add(self, Item);
}

mono_add_internal_call("UnrealEngine.FGenericArray::Internal_Add", &Internal_FGenericArray_Add);