dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.62k stars 4.56k forks source link

GC root enumerating crashing with BULK_WRITEBARRIER helper on the stack #101890

Closed jkotas closed 5 days ago

jkotas commented 2 months ago

Crash dumps:

https://dev.azure.com/dnceng-public/public/_build/results?buildId=666172&view=ms.vss-test-web.build-test-results-tab&runId=16523620&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab&resultId=141696

https://dev.azure.com/dnceng-public/public/_build/results?buildId=666172&view=ms.vss-test-web.build-test-results-tab&runId=16523620&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab&resultId=141697

Both of these are crashes while enumerating GC roots:

* thread #1, name = 'System.Memory.T', stop reason = signal SIGSEGV
  * frame #0: 0x0169219c System.Memory.Tests`WKS::gc_heap::mark_object_simple(unsigned char**) [inlined] MethodTable::HasComponentSize(this=0x00000004) at MethodTable.h:226:25 [opt]
    frame #1: 0x0169219c System.Memory.Tests`WKS::gc_heap::mark_object_simple(unsigned char**) [inlined] WKS::my_get_size(ob=0xeeb38cf8) at gc.cpp:11491 [opt]
    frame #2: 0x01692196 System.Memory.Tests`WKS::gc_heap::mark_object_simple(po=<unavailable>) at gc.cpp:27782 [opt]
    frame #3: 0x01693a28 System.Memory.Tests`WKS::GCHeap::Promote(ppObject=0xf14fec70, sc=<unavailable>, flags=<unavailable>) at gc.cpp:49248:5 [opt]
    frame #4: 0x016b0064 System.Memory.Tests`GcInfoDecoder::ReportUntrackedSlots(GcSlotDecoder&, REGDISPLAY*, unsigned int, void (*)(void*, void**, unsigned int), void*) [inlined] GcInfoDecoder::ReportSlotToGC(this=0xf0afd838, slotDecoder=0xf0afd4e0, slotIndex=10, pRD=0xf0afd948, reportScratchSlots=true, pCallBack=(System.Memory.Tests`EnumGcRefsCallback(void*, void**, unsigned int) + 1 at GcEnum.cpp:119), hCallBack=0xf0afd8c0)(void*, void**, unsigned int), void*) at gcinfodecoder.cpp:0 [opt]
    frame #5: 0x016b001e System.Memory.Tests`GcInfoDecoder::ReportUntrackedSlots(this=0xf0afd838, slotDecoder=0xf0afd4e0, pRD=0xf0afd948, inputFlags=<unavailable>, pCallBack=(System.Memory.Tests`EnumGcRefsCallback(void*, void**, unsigned int) + 1 at GcEnum.cpp:119), hCallBack=0xf0afd8c0)(void*, void**, unsigned int), void*) at gcinfodecoder.cpp:1100 [opt]
    frame #6: 0x016af0d8 System.Memory.Tests`GcInfoDecoder::EnumerateLiveSlots(this=<unavailable>, pRD=0xf0afd948, reportScratchSlots=false, inputFlags=<unavailable>, pCallBack=(System.Memory.Tests`EnumGcRefsCallback(void*, void**, unsigned int) + 1 at GcEnum.cpp:119), hCallBack=0xf0afd8c0)(void*, void**, unsigned int), void*) at gcinfodecoder.cpp:1049:9 [opt]
    frame #7: 0x016b0700 System.Memory.Tests`UnixNativeCodeManager::EnumGcRefs(this=<unavailable>, pMethodInfo=0xf0afd9cc, safePointAddress=<unavailable>, pRegisterSet=<unavailable>, hCallback=0xf0afd8c0, isActiveStackFrame=<unavailable>) at UnixNativeCodeManager.cpp:239:18 [opt]
    frame #8: 0x01679cb4 System.Memory.Tests`EnumGcRefs(pCodeManager=<unavailable>, pMethodInfo=<unavailable>, safePointAddress=<unavailable>, pRegisterSet=<unavailable>, pfnEnumCallback=(System.Memory.Tests`WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int) + 1 at gc.cpp:49182), pvCallbackData=0xf0afdaf0, isActiveStackFrame=<unavailable>)(Object**, ScanContext*, unsigned int), ScanContext*, bool) at GcEnum.cpp:139:19 [opt]
...

The stack trace of the target thread:

  * frame #0: 0xf7caa674 libpthread.so.0`__libc_do_syscall at libc-do-syscall.S:46
    frame #1: 0xf7ca5124 libpthread.so.0`__pthread_cond_wait at futex-internal.h:186:13
    frame #2: 0xf7ca510c libpthread.so.0`__pthread_cond_wait at pthread_cond_wait.c:508
    frame #3: 0xf7ca4fba libpthread.so.0`__pthread_cond_wait(cond=0x049c0910, mutex=0x049c0940) at pthread_cond_wait.c:638
    frame #4: 0x016ab9c6 System.Memory.Tests`GCEvent::Impl::Wait(this=0x049c0910, milliseconds=<unavailable>, alertable=<unavailable>) at events.cpp:149:22 [opt]
    frame #5: 0x0167d768 System.Memory.Tests`Thread::InlineSuspend(UNIX_CONTEXT*) [inlined] Thread::WaitForGC(this=0xf14ff8a0, pTransitionFrame=<unavailable>) at thread.cpp:80:39 [opt]
    frame #6: 0x0167d73a System.Memory.Tests`Thread::InlineSuspend(this=0xf14ff8a0, interruptedContext=<unavailable>) at thread.cpp:884 [opt]
    frame #7: 0x016aa07e System.Memory.Tests`ActivationHandler(code=34, siginfo=0xf14fe898, context=0xf14fe918) at PalRedhawkUnix.cpp:1004:9 [opt]
    frame #8: 0xf7bc2840 libc.so.6 at sigrestorer.S:77
    frame #9: 0x019ece3a System.Memory.Tests`System.Buffer__BulkMoveWithWriteBarrier(destination=0xf14fec1c, source=0xeeb19a5c, byteCount=100) at Buffer.cs:185
    frame #10: 0x01afb91a System.Memory.Tests`System.Reflection.Runtime.TypeInfos.NativeFormat.NativeFormatRuntimeNamedTypeInfo__get_Name(this=0xeeb19a3c) at NativeFormatRuntimeNamedTypeInfo.cs:189
    frame #11: 0x01afafca System.Memory.Tests`System.Reflection.Runtime.TypeInfos.RuntimeNamedTypeInfo__get_FullName(this=0xeeb19a3c) at RuntimeNamedTypeInfo.cs:96
    frame #12: 0x0268774e System.Memory.Tests`System_Linq_System_Linq_Enumerable_ArraySelectIterator_2<System___Canon__System___Canon>__MoveNext(this=0xeeb389b0) at Select.cs:179

Target method:

System.Memory.Tests`System.Reflection.Runtime.TypeInfos.NativeFormat.NativeFormatRuntimeNamedTypeInfo__get_Name:
    0x1afb8f0 <+0>:  push.w {r4, r11, lr}
    0x1afb8f4 <+3>:  sub    sp, #0x74
    0x1afb8f6 <+5>:  add.w  r11, sp, #0x78
    0x1afb8fa <+9>:  movs   r1, #0x0
    0x1afb8fc <+11>: str    r1, [sp]
    0x1afb8fe <+13>: str    r1, [sp, #0x4]
    0x1afb900 <+15>: mov    r4, r0
    0x1afb902 <+17>: add.w  r1, r4, #0x20
    0x1afb906 <+21>: ldrsb.w r0, [r1]
    0x1afb90a <+25>: movw   r3, #0x151b
    0x1afb90e <+29>: movt   r3, #0xffef
    0x1afb912 <+33>: add    r3, pc
    0x1afb914 <+35>: add    r0, sp, #0xc
    0x1afb916 <+37>: movs   r2, #0x64
    0x1afb918 <+39>: blx    r3 <- CORINFO_HELP_BULK_WRITEBARRIER
    0x1afb91a <+41>: ldr    r0, [r4, #0x14] <---- we crash enumerating GC roots here
jkotas commented 2 months ago

@egorbo It looks like the GC reporting is messed up around the new buld write barrier helper. Could you please take a look?

So far, I have seen it on native aot linux-arm only. We seem to have higher number of intermittent crashes than usual momentarily, with multiple different root causes. So it is not easy to tell whether this specific crash is hitting linux-arm only.

EgorBo commented 2 months ago

@EgorBo It looks like the GC reporting is messed up around the new buld write barrier helper. Could you please take a look?

So far, I have seen it on native aot linux-arm only. We seem to have higher number of intermittent crashes than usual momentarily, with multiple different root causes. So it is not easy to tell whether this specific crash is hitting linux-arm only.

@SingleAccretion made an interesting guess that it might be related to https://github.com/dotnet/runtime/issues/99410#issuecomment-2034385058 (hard to tell from the asm you attached whether it's tallcall arg setup region or not)

EgorBo commented 2 months ago

ah, very unlikely here, I don't have any arm32 device to test, but on 64bit we don't emit any tail calls in that function so seems unlikely

EgorBo commented 5 days ago

Seems like it's not failing anymore, very likely fixed by https://github.com/dotnet/runtime/pull/103301 which removed such helpers out of nogc blocks + potentially https://github.com/dotnet/runtime/pull/102580