Open srxqds opened 1 week ago
@BrzVlad @lateralusX can you help me?
any solution can work around?
and the crash stacktrace:
0 libmonosgen-2.0.so!copy_object_no_checks [sgen-copy-object.h : 69 + 0x0]
x0 = 0x000000763c3774b0 x1 = 0x0000007682016920
x2 = 0x0000007682016920 x3 = 0x0000000000000377
x4 = 0x0000000000000800 x5 = 0x00000075a9151260
x6 = 0x0000007682016720 x7 = 0x0000007631354000
x8 = 0x000000763c000000 x9 = 0x000000000000c007
x10 = 0x0000000000000000 x11 = 0x0000000000000000
x12 = 0x0000000000000138 x13 = 0x0000000000000019
x14 = 0x000000002cb03dee x15 = 0x00002f82bd2e6a06
x16 = 0x000000767c150448 x17 = 0x00000077c073ddc0
x18 = 0xffffffffffffffda x19 = 0x0000007682016920
x20 = 0x000000763c3774b0 x21 = 0x0000000000000000
x22 = 0x000000763c3774b0 x23 = 0x000000000000003a
x24 = 0xffffffffffffffff x25 = 0x000000767c16c010
x26 = 0x0000000000000000 x27 = 0x0000000000000001
x28 = 0x000000767c16b000 fp = 0x0000007682016600
lr = 0x000000767c120c10 sp = 0x0000007682016600
pc = 0x000000767c1362ac
Found by: given as instruction pointer in context
1 libmonosgen-2.0.so!major_scan_object_with_evacuation [sgen-scan-object.h : 66 + 0x4]
fp = 0x0000007682016690 lr = 0x000000767c120c10
sp = 0x0000007682016610 pc = 0x000000767c120c10
Found by: previous frame's frame pointer
2 libmonosgen-2.0.so!major_scan_object_with_evacuation [sgen-scan-object.h : 66 + 0x4]
fp = 0x0000007682016740 lr = 0x000000767c122d54
sp = 0x00000076820166a0 pc = 0x000000767c120c10
Found by: previous frame's frame pointer
3 libmonosgen-2.0.so!drain_gray_stack [sgen-marksweep.c : 1285 + 0x0]
fp = 0x0000007682016840 lr = 0x000000767c110b30
sp = 0x0000007682016750 pc = 0x000000767c122d54
Found by: previous frame's frame pointer
4 libmonosgen-2.0.so!finish_gray_stack [sgen-gc.c : 1140 + 0x4]
fp = 0x00000076820168c0 lr = 0x000000767c111604
sp = 0x0000007682016850 pc = 0x000000767c110b30
Found by: previous frame's frame pointer
5 libmonosgen-2.0.so!major_finish_collection [sgen-gc.c : 2323 + 0xc]
fp = 0x0000007682016970 lr = 0x000000767c11039c
sp = 0x00000076820168d0 pc = 0x000000767c111604
Found by: previous frame's frame pointer
6 libmonosgen-2.0.so!major_do_collection [sgen-gc.c : 2465 + 0x14]
fp = 0x0000007682016a50 lr = 0x000000767c10c204
sp = 0x0000007682016980 pc = 0x000000767c11039c
Found by: previous frame's frame pointer
7 libmonosgen-2.0.so!sgen_perform_collection [sgen-gc.c : 2762 + 0xc]
fp = 0x0000007682016ab0 lr = 0x000000767c10d198
sp = 0x0000007682016a60 pc = 0x000000767c10c204
Found by: previous frame's frame pointer
8 libmonosgen-2.0.so!sgen_gc_collect [sgen-gc.c : 3214 + 0x18]
fp = 0x0000007682016ae0 lr = 0x000000767c0f2d38
sp = 0x0000007682016ac0 pc = 0x000000767c10d198
Found by: previous frame's frame pointer
9 libmonosgen-2.0.so!mono_gc_collect [sgen-mono.c : 2359 + 0x4]
Tagging subscribers to this area: @brzvlad See info in area-owners.md if you want to be subscribed.
In the provided sample there isn't any code modifying the dictionary. Where does this happen ? Is the dictionary modified from unmanaged code ? Is SectionTable modified from unmanaged code ?
In the provided sample there isn't any code modifying the dictionary. Where does this happen ? Is the dictionary modified from unmanaged code ? Is SectionTable modified from unmanaged code ?
No. All modified is from managed code
Unfortunately, in this scenario, it appears that check-remset-consistency
reports a false positive, which is awkward to fix. The reason is that, in a valuetype array where an element spans over multiple cards, it is enough to mark a single card for this element. This marked card will guarantee that the entire valuetype will be scanned. The remset consistency check doesn't take this into account so it can report false complaints.
Unfortunately, in this scenario, it appears that
check-remset-consistency
reports a false positive, which is awkward to fix. The reason is that, in a valuetype array where an element spans over multiple cards, it is enough to mark a single card for this element. This marked card will guarantee that the entire valuetype will be scanned. The remset consistency check doesn't take this into account so it can report false complaints.
so, hope you can fix this problem, This issue has a great impact on us!
we want to use check-remset-consistency check the wrong place, but now it can't be used
Why would you want to use this option ? It is just a debug option to potentially help investigate issues, it is not meant to be used in production.
Why would you want to use this option ? It is just a debug option to potentially help investigate issues, it is not meant to be used in production.
Yeah,we hit the crash on shipping app,we need use this debug option in development for check.
Otherwise we can't figure out the issue
There is another gc debug option: MONO_GC_DEBUG=binary-protocol=output-file
. If you have this enabled in addition to check-remset-consistency
, it will not crash but rather just report the found issues. You could then check the console output to see if there are any other missing remsets aside from these false positives. (The false positives should be easy to identify because they happen in a array of valuetypes where the missing card is for an address at a card boundary. In your example, the missing card is for an address at 0x24F68424130
+ 1744 which is 0x24F68424800
, right at a card boundary; a card is for a region of 0x200
bytes)
ok,another quesion what the binary-protocol
use?
There is another gc debug option:
MONO_GC_DEBUG=binary-protocol=output-file
. If you have this enabled in addition tocheck-remset-consistency
, it will not crash but rather just report the found issues. You could then check the console output to see if there are any other missing remsets aside from these false positives. (The false positives should be easy to identify because they happen in a array of valuetypes where the missing card is for an address at a card boundary. In your example, the missing card is for an address at0x24F68424130
+ 1744 which is0x24F68424800
, right at a card boundary; a card is for a region of0x200
bytes)
yes, this setting can avoid crash ,but we can't found the check-remset-consistency wrong, only output
we also found another false case:
2024-10-30 17:49:49 Oldspace->newspace reference 00000157842FC390 at offset 96 in object 00000157BD2697B0 (.StateMachineBox`1) not found in remsets.
2024-10-30 17:49:49 Oldspace->newspace reference 00000157842ED788 at offset 120 in object 00000157BD2697B0 (.StateMachineBox`1) not found in remsets, but object is pinned.
so hope you can fix it.
we also found another false case:
2024-10-30 17:49:49 Oldspace->newspace reference 00000157842FC390 at offset 96 in object 00000157BD2697B0 (.StateMachineBox`1) not found in remsets. 2024-10-30 17:49:49 Oldspace->newspace reference 00000157842ED788 at offset 120 in object 00000157BD2697B0 (.StateMachineBox`1) not found in remsets, but object is pinned.
so hope you can fix it.
This looks like a real bug, but I'm not able to tell what is going on just from this log. Would you be able to run your application and obtain such reports with a custom build runtime that has additional logging included ?
this is gc-debug log
yes, the runtime we used building by ourself.
the StateMachineBox
is the class defined in bcl not our code
Is this crash happening also on android ? Some of the other reports could also be due to gc issues. I think it is best to follow up on this check remset consistency report.
Could you attempt to obtain the same remset consistency failure with the following diff:
diff --git a/src/mono/mono/sgen/sgen-debug.c b/src/mono/mono/sgen/sgen-debug.c
index ea2caeba2b6..c9f1f27a7d8 100644
--- a/src/mono/mono/sgen/sgen-debug.c
+++ b/src/mono/mono/sgen/sgen-debug.c
@@ -163,8 +163,11 @@ static gboolean missing_remsets;
gboolean is_pinned = object_is_pinned (*(ptr)); \
SGEN_LOG (0, "Oldspace->newspace reference %p at offset %ld in object %p (%s.%s) not found in remsets%s.", *(ptr), (long)((char*)(ptr) - (char*)(obj)), (obj), sgen_client_vtable_get_namespace (__vt), sgen_client_vtable_get_name (__vt), is_pinned ? ", but object is pinned" : ""); \
sgen_binary_protocol_missing_remset ((obj), __vt, (int) ((char*)(ptr) - (char*)(obj)), *(ptr), (gpointer)LOAD_VTABLE(*(ptr)), is_pinned); \
- if (!is_pinned) \
+ if (!is_pinned) { \
+ mono_object_describe_fields (obj); \
+ mono_object_describe ((MonoObject*)*ptr); \
missing_remsets = TRUE; \
+ } \
} \
} \
} while (0)
It might help explain why is there a missing wbarrier when storing into an StateMachineBox
Is this crash happening also on android ? Some of the other reports could also be due to gc issues. I think it is best to follow up on this check remset consistency report.
Could you attempt to obtain the same remset consistency failure with the following diff:
diff --git a/src/mono/mono/sgen/sgen-debug.c b/src/mono/mono/sgen/sgen-debug.c index ea2caeba2b6..c9f1f27a7d8 100644 --- a/src/mono/mono/sgen/sgen-debug.c +++ b/src/mono/mono/sgen/sgen-debug.c @@ -163,8 +163,11 @@ static gboolean missing_remsets; gboolean is_pinned = object_is_pinned (*(ptr)); \ SGEN_LOG (0, "Oldspace->newspace reference %p at offset %ld in object %p (%s.%s) not found in remsets%s.", *(ptr), (long)((char*)(ptr) - (char*)(obj)), (obj), sgen_client_vtable_get_namespace (__vt), sgen_client_vtable_get_name (__vt), is_pinned ? ", but object is pinned" : ""); \ sgen_binary_protocol_missing_remset ((obj), __vt, (int) ((char*)(ptr) - (char*)(obj)), *(ptr), (gpointer)LOAD_VTABLE(*(ptr)), is_pinned); \ - if (!is_pinned) \ + if (!is_pinned) { \ + mono_object_describe_fields (obj); \ + mono_object_describe ((MonoObject*)*ptr); \ missing_remsets = TRUE; \ + } \ } \ } \ } while (0)
It might help explain why is there a missing wbarrier when storing into an
StateMachineBox
yes,but I open debug on windows. thank you i will try later
As we discuss in this issues at https://github.com/dotnet/runtime/issues/85318#issuecomment-1533920196. we use
check-remset-consistency
debug option for gc problem, we always hit the problem which seems not caused by our source code.I have debug view the address
0000024F68424130
which is the dictionary field entry[], the value type is our typeSectionTable
, and the address0000024F03CDE4D0
and0000024F03CDED68
object type is int32[].our source code show below:
two problems:
SectionTables
cause the problem remsets.