Segmentation Fault in libcoreclr Since .Net 8 Upgrade

adamrodger commented 8 months ago

Description

I have a C# application which does interop with a native C++ library which has started to intermittently crash with a segmentation fault since upgrading to .Net 8. The code is unmodified from prior to the upgrade, and has worked in the past on .Net 5, 6 and 7 in production use without a segfault.

Reproduction Steps

I have written a few smaller apps to reproduce the issue, but it's difficult to share them given they rely on calling the proprietary C++ library.

Essentially the apps:

Load a large amount of data from some source - e.g. a REST call or from a JSON file
Use various parts of the loaded data to make interop calls to the C++ library
Repeat in a loop

The segfault always happens during the data load step (i.e. when a large amount of managed memory is being allocated), but only once the interop has happened at least once. I've verified stubbing out the interop calls and then the app never segfaults.

Expected behavior

The app works as per .Net 7 version.

Actual behavior

The app crashes with a segfault intermittently.

Regression?

Yes, the app worked on .Net 7 and I can change the version back to .Net 7 and it continues to work fine.

Known Workarounds

No known workarounds other than not using .Net 8, although the problem does get worse the less memory the container has. Our theory is that this triggers more frequent garbage collections and this is where the segfault occurs.

Configuration

.Net Version: 8.0.2 OS: Linux (Debian bookworm) Container image: mcr.microsoft.com/dotnet/sdk:8.0.2 (used for analysing core dumps) and also in mcr.microsoft.com/dotnet/runtime:8.0.2 in production Architecture: x86-64

Other information

LLDB backtrace:

* thread #1, name = 'dotnet', stop reason = signal SIGSEGV
  * frame #0: 0x00007f356d0a298a libcoreclr.so`IsIPInEpilog(_CONTEXT*, EECodeInfo*, int*) [inlined] IsIPInProlog(pCodeInfo=0x00007f356db38328) at excep.cpp:7169:49
    frame #1: 0x00007f356d0a2987 libcoreclr.so`IsIPInEpilog(pContextToCheck=0x00007f356db39e60, pCodeInfo=0x00007f356db38328, pSafeToInjectThreadAbort=YES) at excep.cpp:7236:9
    frame #2: 0x00007f356d24cca4 libcoreclr.so`HandleSuspensionForInterruptedThread(interruptedContext=0x00007f356db39e60) at threadsuspend.cpp:5914:13
    frame #3: 0x00007f356d48208c libcoreclr.so`inject_activation_handler(code=<unavailable>, siginfo=<unavailable>, context=0x00007f356db3aac0) at signal.cpp:840:13
    frame #4: 0x00007f356d66c050 libc.so.6`__restore_rt
    frame #5: 0x00007f34f0b200b0
    frame #6: 0x00007f34eed0fc79
    frame #7: 0x00007f34eecb326e
    frame #8: 0x00007f34eecb2f30
    frame #9: 0x00007f34eecb1997
    frame #10: 0x00007f356d33ac27 libcoreclr.so`CallDescrWorkerInternal at calldescrworkeramd64.S:97
    frame #11: 0x00007f356d17320e libcoreclr.so`MethodDescCallSite::CallTargetWorker(unsigned long const*, unsigned long*, int) at callhelpers.cpp:67:5
    frame #12: 0x00007f356d1731b4 libcoreclr.so`MethodDescCallSite::CallTargetWorker(this=<unavailable>, pArguments=0x00007ffe459a6ce8, pReturnValue=0x0000000000000000, cbReturnValue=0) at callhelpers.cpp:562:9
    frame #13: 0x00007f356d059bb4 libcoreclr.so`RunMain(MethodDesc*, short, int*, PtrArray**) [inlined] MethodDescCallSite::Call(this=0x00007ffe459a6d48, pArguments=0x00007ffe459a6ce8) at callhelpers.h:458:9
    frame #14: 0x00007f356d059bab libcoreclr.so`RunMain(MethodDesc*, short, int*, PtrArray**) at assembly.cpp:1303:21
    frame #15: 0x00007f356d059a5d libcoreclr.so`RunMain(MethodDesc*, short, int*, PtrArray**) [inlined] RunMain(this=<unavailable>, pParam=0x00007ffe459a6cb0)::$_0::operator()(Param*) const::'lambda'(Param*)::operator()(Param*) const at assembly.cpp:1375:9
    frame #16: 0x00007f356d059a5d libcoreclr.so`RunMain(MethodDesc*, short, int*, PtrArray**) at assembly.cpp:1377:5
    frame #17: 0x00007f356d059a50 libcoreclr.so`RunMain(pFD=0x00007f34eed3f348, numSkipArgs=1, piRetVal=0x00007ffe459a6e9c, stringArgs=0x00007ffe459a7170) at assembly.cpp:1377:5
    frame #18: 0x00007f356d05a028 libcoreclr.so`Assembly::ExecuteMainMethod(this=0x0000558604418360, stringArgs=0x00007ffe459a7170, waitForOtherThreads=YES) at assembly.cpp:1504:18
    frame #19: 0x00007f356d08707c libcoreclr.so`CorHost2::ExecuteAssembly(this=<unavailable>, dwAppDomainId=<unavailable>, pwzAssemblyPath=<unavailable>, argc=0, argv=0x0000000000000000, pReturnValue=0x00007ffe459a72a0) at corhost.cpp:349:39
    frame #20: 0x00007f356d0443f0 libcoreclr.so`::coreclr_execute_assembly(hostHandle=0x0000558604417fa0, domainId=1, argc=0, argv=<unavailable>, managedAssemblyPath=<unavailable>, exitCode=0x00007ffe459a72a0) at exports.cpp:504:24
    frame #21: 0x00007f356d5c2aee libhostpolicy.so`run_app_for_context(context=0x00005586043eed60, argc=0, argv=0x00007ffe459a7a08) at hostpolicy.cpp:250:32
    frame #22: 0x00007f356d5c3c09 libhostpolicy.so`::corehost_main(const int, const pal::char_t **) [inlined] run_app(argc=0, argv=0x00007ffe459a7a08) at hostpolicy.cpp:285:12
    frame #23: 0x00007f356d5c3be9 libhostpolicy.so`::corehost_main(argc=2, argv=<unavailable>) at hostpolicy.cpp:426:12
    frame #24: 0x00007f356d601333 libhostfxr.so`fx_muxer_t::handle_exec_host_command(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, host_startup_info_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<known_options, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, known_options_hash, std::equal_to<known_options>, std::allocator<std::pair<known_options const, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > const&, int, char const**, int, host_mode_t, bool, char*, int, int*) at fx_muxer.cpp:145:20
    frame #25: 0x00007f356d60109b libhostfxr.so`fx_muxer_t::handle_exec_host_command(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, host_startup_info_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<known_options, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, known_options_hash, std::equal_to<known_options>, std::allocator<std::pair<known_options const, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > const&, int, char const**, int, host_mode_t, bool, char*, int, int*) [inlined] (anonymous namespace)::read_config_and_execute(host_command=<unavailable>, host_info=<unavailable>, app_candidate=""..., opts=size=0, new_argc=2, new_argv=0x00007ffe459a79f8, mode=<unavailable>, is_sdk_command=<unavailable>, out_buffer=<unavailable>, buffer_size=<unavailable>, required_buffer_size=<unavailable>) at fx_muxer.cpp:532:18
    frame #26: 0x00007f356d600f96 libhostfxr.so`fx_muxer_t::handle_exec_host_command(host_command="", host_info=0x00007ffe459a7690, app_candidate=""..., opts=size=0, argc=2, argv=0x00007ffe459a79f8, argoff=1, mode=muxer, is_sdk_command=<unavailable>, result_buffer=0x0000000000000000, buffer_size=0, required_buffer_size=0x0000000000000000) at fx_muxer.cpp:1007:12
    frame #27: 0x00007f356d60030d libhostfxr.so`fx_muxer_t::execute(host_command="", argc=2, argv=0x00007ffe459a79f8, host_info=0x00007ffe459a7690, result_buffer=0x0000000000000000, buffer_size=0, required_buffer_size=0x0000000000000000) at fx_muxer.cpp:578:18
    frame #28: 0x00007f356d5fc5a2 libhostfxr.so`::hostfxr_main_startupinfo(argc=2, argv=0x00007ffe459a79f8, host_path="/usr/share/dotnet/dotnet", dotnet_root="/usr/share/dotnet/", app_path="/usr/share/dotnet/dotnet.dll") at hostfxr.cpp:62:12
    frame #29: 0x0000558603bddf80 dotnet`exe_start(argc=2, argv=0x00007ffe459a79f8) at corehost.cpp:240:18
    frame #30: 0x0000558603bde26f dotnet`main(argc=2, argv=0x00007ffe459a79f8) at corehost.cpp:308:21
    frame #31: 0x00007f356d65724a libc.so.6`__libc_start_call_main(main=(dotnet`main at corehost.cpp:290), argc=2, argv=0x00007ffe459a79f8) at libc_start_call_main.h:58:16
    frame #32: 0x00007f356d657305 libc.so.6`__libc_start_main_impl(main=(dotnet`main at corehost.cpp:290), argc=2, argv=0x00007ffe459a79f8, init=(_rtld_global), fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007ffe459a79e8) at libc-start.c:360:3
    frame #33: 0x0000558603bd4c49 dotnet`_start + 41

LLDB frame variable:

(EECodeInfo *) pCodeInfo = 0x00007f356db38328
(bool) fInsideProlog = true
(PTR_RUNTIME_FUNCTION) funcEntry = <variable not available>

(DWORD) prologLen = <variable not available>

I've tried to configure createdump to capture core dumps when the application crashes but it never seems to trigger (it does on other apps). I assume the segfault prevents the createdump from triggering so I only have the standard core dump written by Linux. I've tried to get the CLR stack output but the core dump doesn't seem to work with SOS.

I've also run the test apps on Windows and I can never get them to segfault. They only seem to segfault on Linux.

adamrodger commented 8 months ago

Just to confirm our theory that the segfault happens during GC, I added the following at the end of the loop after the interop code has been disposed/freed:

GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced, blocking: true, compacting: true)
GC.Collect(GC.MaxGeneration, GCCollectionMode.Aggressive, blocking: true, compacting: true)

The segfault is still intermittent, but now it happens during the Aggressive GC (the second one). The core dump name has now changed also. Now the dump is named core..NET Finalizer.8 and the thread backtrace is

* thread #1, name = 'dotnet', stop reason = signal SIGSEGV
  * frame #0: 0x00007fec3fa2f98a libcoreclr.so`IsIPInEpilog(_CONTEXT*, EECodeInfo*, int*) [inlined] IsIPInProlog(pCodeInfo=0x00007fec3cb51328) at excep.cpp:7169:49
    frame #1: 0x00007fec3fa2f987 libcoreclr.so`IsIPInEpilog(pContextToCheck=0x00007fec3cb52e60, pCodeInfo=0x00007fec3cb51328, pSafeToInjectThreadAbort=YES) at excep.cpp:7236:9
    frame #2: 0x00007fec3fbd9ca4 libcoreclr.so`HandleSuspensionForInterruptedThread(interruptedContext=0x00007fec3cb52e60) at threadsuspend.cpp:5914:13
    frame #3: 0x00007fec3fe0f08c libcoreclr.so`inject_activation_handler(code=<unavailable>, siginfo=<unavailable>, context=0x00007fec3cb53ac0) at signal.cpp:840:13
    frame #4: 0x00007fec3fff9050 libc.so.6`___lldb_unnamed_symbol3252 + 1
    frame #5: 0x00007febc0bd7aea
    frame #6: 0x00007fec3fcc7ba6 libcoreclr.so`FastCallFinalizeWorker at calldescrworkeramd64.S:30
    frame #7: 0x00007fec3fa84213 libcoreclr.so`MethodTable::CallFinalizer(Object*) at methodtable.cpp:4770:5
    frame #8: 0x00007fec3fa841c9 libcoreclr.so`MethodTable::CallFinalizer(obj=0x00007feb92834630) at methodtable.cpp:4888:5
    frame #9: 0x00007fec3fb3d955 libcoreclr.so`FinalizerThread::FinalizeAllObjects() [inlined] CallFinalizer(obj=0x00007feb92834630) at finalizerthread.cpp:75:9
    frame #10: 0x00007fec3fb3d8fd libcoreclr.so`FinalizerThread::FinalizeAllObjects() at finalizerthread.cpp:104:9
    frame #11: 0x00007fec3fb3dba5 libcoreclr.so`FinalizerThread::FinalizerThreadWorker(args=<unavailable>) at finalizerthread.cpp:348:9
    frame #12: 0x00007fec3face7c5 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchInner(pCallState=<unavailable>) at threads.cpp:7222:5
    frame #13: 0x00007fec3face7c3 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7266:9
    frame #14: 0x00007fec3face788 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchOuter(this=<unavailable>, pParam=<unavailable>)::$_0::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const::'lambda'(Param*)::operator()(Param*) const at threads.cpp:7424:13
    frame #15: 0x00007fec3face788 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7426:9
    frame #16: 0x00007fec3face765 libcoreclr.so`ManagedThreadBase_DispatchOuter(pCallState=0x00007fec3d353de0) at threads.cpp:7450:5
    frame #17: 0x00007fec3facedcd libcoreclr.so`ManagedThreadBase::FinalizerBase(void (*)(void*)) [inlined] ManagedThreadBase_NoADTransition(pTarget=<unavailable>, filterType=FinalizerThread)(void*), UnhandledExceptionLocation) at threads.cpp:7494:5
    frame #18: 0x00007fec3facedb5 libcoreclr.so`ManagedThreadBase::FinalizerBase(pTarget=<unavailable>)(void*)) at threads.cpp:7513:5
    frame #19: 0x00007fec3fb3de48 libcoreclr.so`FinalizerThread::FinalizerThreadStart(args=<unavailable>) at finalizerthread.cpp:398:17
    frame #20: 0x00007fec3fe3effe libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x000055accab9f510) at thread.cpp:1760:16
    frame #21: 0x00007fec40046134 libc.so.6`___lldb_unnamed_symbol3514 + 708
    frame #22: 0x00007fec400c67dc libc.so.6`___lldb_unnamed_symbol3939 + 11

jkotas commented 8 months ago

Could you please disassemble the code around the place that it is crashing at and find the exact instruction where the crash occurs?

jkotas commented 8 months ago

cc @VSadov

adamrodger commented 8 months ago

Disasembling at the first frame gives

(lldb) disassemble -a 0x00007fec3fa2f98a
libcoreclr.so`IsIPInEpilog:
    0x7fec3fa2f950 <+0>:   pushq  %rbp
    0x7fec3fa2f951 <+1>:   movq   %rsp, %rbp
    0x7fec3fa2f954 <+4>:   pushq  %r15
    0x7fec3fa2f956 <+6>:   pushq  %r14
    0x7fec3fa2f958 <+8>:   pushq  %r13
    0x7fec3fa2f95a <+10>:  pushq  %r12
    0x7fec3fa2f95c <+12>:  pushq  %rbx
    0x7fec3fa2f95d <+13>:  subq   $0xd38, %rsp              ; imm = 0xD38
    0x7fec3fa2f964 <+20>:  movq   %rdx, %rbx
    0x7fec3fa2f967 <+23>:  movq   %rsi, %r12
    0x7fec3fa2f96a <+26>:  movq   %rdi, %r13
    0x7fec3fa2f96d <+29>:  movq   %fs:0x28, %rax
    0x7fec3fa2f976 <+38>:  movq   %rax, -0x30(%rbp)
    0x7fec3fa2f97a <+42>:  movq   0xf8(%rdi), %r14
    0x7fec3fa2f981 <+49>:  movl   $0x1, (%rdx)
    0x7fec3fa2f987 <+55>:  movq   %rsi, %rdi
->  0x7fec3fa2f98a <+58>:  callq  0x243610                  ; EECodeInfo::GetFunctionEntry at jitinterface.cpp:14478
    0x7fec3fa2f98f <+63>:  movq   0x8(%r12), %rcx
    0x7fec3fa2f994 <+68>:  movq   (%rcx), %rcx
    0x7fec3fa2f997 <+71>:  movl   0x8(%rax), %eax
    0x7fec3fa2f99a <+74>:  movzbl 0x1(%rcx,%rax), %eax
    0x7fec3fa2f99f <+79>:  cmpl   %eax, 0x28(%r12)
    0x7fec3fa2f9a4 <+84>:  jae    0x2079ad                  ; <+93> at excep.cpp:7259:15
    0x7fec3fa2f9a6 <+86>:  xorl   %eax, %eax
    0x7fec3fa2f9a8 <+88>:  jmp    0x207ab3                  ; <+355> at excep.cpp:7306:1
    0x7fec3fa2f9ad <+93>:  movq   $0x0, -0xd60(%rbp)
    0x7fec3fa2f9b8 <+104>: movq   %r12, %rdi
    0x7fec3fa2f9bb <+107>: callq  0x243610                  ; EECodeInfo::GetFunctionEntry at jitinterface.cpp:14478
    0x7fec3fa2f9c0 <+112>: movq   %rax, %r15
    0x7fec3fa2f9c3 <+115>: movq   0x8(%r12), %rax
    0x7fec3fa2f9c8 <+120>: movq   (%rax), %r12
    0x7fec3fa2f9cb <+123>: leaq   -0xb50(%rbp), %rdi
    0x7fec3fa2f9d2 <+130>: movl   $0xb20, %edx              ; imm = 0xB20
    0x7fec3fa2f9d7 <+135>: xorl   %esi, %esi
    0x7fec3fa2f9d9 <+137>: callq  0x65e600                  ; symbol stub for: memset
    0x7fec3fa2f9de <+142>: leaq   -0xc50(%rbp), %rdi
    0x7fec3fa2f9e5 <+149>: movl   $0x100, %edx              ; imm = 0x100
    0x7fec3fa2f9ea <+154>: movq   %r13, %rsi
    0x7fec3fa2f9ed <+157>: callq  0x65e660                  ; symbol stub for: memcpy
    0x7fec3fa2f9f2 <+162>: xorps  %xmm0, %xmm0
    0x7fec3fa2f9f5 <+165>: movaps %xmm0, -0xc60(%rbp)
    0x7fec3fa2f9fc <+172>: movaps %xmm0, -0xc70(%rbp)
    0x7fec3fa2fa03 <+179>: movaps %xmm0, -0xc80(%rbp)
    0x7fec3fa2fa0a <+186>: movaps %xmm0, -0xc90(%rbp)
    0x7fec3fa2fa11 <+193>: movaps %xmm0, -0xca0(%rbp)
    0x7fec3fa2fa18 <+200>: movaps %xmm0, -0xcb0(%rbp)
    0x7fec3fa2fa1f <+207>: movaps %xmm0, -0xcc0(%rbp)
    0x7fec3fa2fa26 <+214>: movaps %xmm0, -0xcd0(%rbp)
    0x7fec3fa2fa2d <+221>: movaps %xmm0, -0xce0(%rbp)
    0x7fec3fa2fa34 <+228>: movaps %xmm0, -0xcf0(%rbp)
    0x7fec3fa2fa3b <+235>: movaps %xmm0, -0xd00(%rbp)
    0x7fec3fa2fa42 <+242>: movaps %xmm0, -0xd10(%rbp)
    0x7fec3fa2fa49 <+249>: movaps %xmm0, -0xd20(%rbp)
    0x7fec3fa2fa50 <+256>: movaps %xmm0, -0xd30(%rbp)
    0x7fec3fa2fa57 <+263>: movaps %xmm0, -0xd40(%rbp)
    0x7fec3fa2fa5e <+270>: movaps %xmm0, -0xd50(%rbp)
    0x7fec3fa2fa65 <+277>: leaq   -0xd50(%rbp), %rax
    0x7fec3fa2fa6c <+284>: leaq   -0xd60(%rbp), %r10
    0x7fec3fa2fa73 <+291>: leaq   -0xd58(%rbp), %r9
    0x7fec3fa2fa7a <+298>: movl   $0x1, %edi
    0x7fec3fa2fa7f <+303>: movq   %r12, %rsi
    0x7fec3fa2fa82 <+306>: movq   %r14, %rdx
    0x7fec3fa2fa85 <+309>: movq   %r15, %rcx
    0x7fec3fa2fa88 <+312>: leaq   -0xc50(%rbp), %r8
    0x7fec3fa2fa8f <+319>: pushq  %rax
    0x7fec3fa2fa90 <+320>: pushq  %r10
    0x7fec3fa2fa92 <+322>: callq  0x3e2f70                  ; RtlVirtualUnwind_Wrapper at excepamd64.cpp:151
    0x7fec3fa2fa97 <+327>: addq   $0x10, %rsp
    0x7fec3fa2fa9b <+331>: movq   %rax, %rcx
    0x7fec3fa2fa9e <+334>: testq  %rax, %rax
    0x7fec3fa2faa1 <+337>: sete   %al
    0x7fec3fa2faa4 <+340>: orq    -0xca8(%rbp), %rcx
    0x7fec3fa2faab <+347>: jne    0x207ab3                  ; <+355> at excep.cpp:7306:1
    0x7fec3fa2faad <+349>: movl   $0x0, (%rbx)
    0x7fec3fa2fab3 <+355>: movq   %fs:0x28, %rcx
    0x7fec3fa2fabc <+364>: cmpq   -0x30(%rbp), %rcx
    0x7fec3fa2fac0 <+368>: jne    0x207ad4                  ; <+388> at excep.cpp
    0x7fec3fa2fac2 <+370>: addq   $0xd38, %rsp              ; imm = 0xD38
    0x7fec3fa2fac9 <+377>: popq   %rbx
    0x7fec3fa2faca <+378>: popq   %r12
    0x7fec3fa2facc <+380>: popq   %r13
    0x7fec3fa2face <+382>: popq   %r14
    0x7fec3fa2fad0 <+384>: popq   %r15
    0x7fec3fa2fad2 <+386>: popq   %rbp
    0x7fec3fa2fad3 <+387>: retq
    0x7fec3fa2fad4 <+388>: callq  0x65e5f0                  ; symbol stub for: __stack_chk_fail
(lldb)

jkotas commented 8 months ago

Could you please dump the register values? (info registers)

rmsimpson commented 8 months ago

but only once the interop has happened at least once. I've verified stubbing out the interop calls and then the app never segfaults.

Can we see the definitions of the C++ function and the .NET import definition?

Since this happens after an interop has occurred, usually it means something in the interop corrupted something, and the difference in behavior between .net 7 and 8 could just be due to timing, or memory being arranged, rearranged, allocated or collected differently.

adamrodger commented 8 months ago

Hopefully this is the correct registers command:

(lldb) register read
General Purpose Registers:
       rax = 0x4530139429087c00
       rbx = 0x00007fec3cb51324
       rcx = 0x00007fec3d3539a8
       rdx = 0x00007fec3cb51324
       rdi = 0x00007fec3cb51328
       rsi = 0x00007fec3cb51328
       rbp = 0x00007fec3cb51310
       rsp = 0x00007fec3cb505b0
        r8 = 0x0000000000000000
        r9 = 0x00c9c9ce88340394
       r10 = 0x0000000000000014
       r11 = 0x0000000000000002
       r12 = 0x00007fec3cb51328
       r13 = 0x00007fec3cb52e60
       r14 = 0x00007febc0bd7aea
       r15 = 0x00007fec3d354618
       rip = 0x00007fec3fa2f98a  libcoreclr.so`IsIPInEpilog(_CONTEXT*, EECodeInfo*, int*) + 58 [inlined] IsIPInProlog(EECodeInfo*) + 3 at excep.cpp:7169:49
  libcoreclr.so`IsIPInEpilog(_CONTEXT*, EECodeInfo*, int*) + 55 at excep.cpp:7236:9
    rflags = 0x0000000000010202
        cs = 0x0000000000000033
        fs = 0x0000000000000000
        gs = 0x0000000000000000
        ss = 0x000000000000002b
        ds = 0x0000000000000000
        es = 0x0000000000000000

jkotas commented 8 months ago

The only ways for callq 0x243610 instruction to crash with segmentation fault is stack overflow, or (stack or code) memory getting unmapped, or (stack or code) memory protection being changed. It is not stackoverflow since the rsp register is not at the page boundary. So the only possible explanation that I can think of is memory getting unmapped or memory protection being changed.

adamrodger commented 8 months ago

I can share the interop code, but obviously I've had to obfuscate the names slightly. I can't share the C++ library itself.

The C++ headers expose a C-like interop:

extern "C" {
    typedef void *handle_t;

    handle_t CreateInstance();
    void ReleaseInstance(handle_t instance);

    int32_t AddOne(handle_t instance, int32_t arg1, double arg2, int32_t arg3, int32_t arg4);
    void AddTwo(handle_t instance, int32_t arg1, int32_t arg2, int32_t arg3, double arg4);
    void AddThree(handle_t instance, uint32_t arg1, char *arg2);
    void AddFour(handle_t instance, uint32_t arg1, const char *arg2, uint32_t arg3);
    void AddFive(handle_t instance, uint32_t arg1, double arg2, const char *arg3);
    void AddSix(handle_t instance, uint32_t arg1, double arg2, double arg3, double arg4, double arg5, double arg6, bool arg7, char *arg8, double arg9, bool arg10);
    void AddSeven(handle_t instance, uint32_t arg1, uint32_t arg2);

    void Calculate(handle_t instance);
    double GetResult(handle_t instance, uint32_t arg1);
}

And the .Net side interops to it using the new library import source generator:

internal static partial class Interop
{
    private const string LibraryName = "mylibrary";

    [LibraryImport(LibraryName)]
    public static partial InteropSafeHandle CreateInstance();

    [LibraryImport(LibraryName)]
    public static partial void ReleaseInstance(IntPtr handle);

    [LibraryImport(LibraryName, StringMarshalling = StringMarshalling.Utf8)]
    public static partial int AddOne(InteropSafeHandle handle, int arg1, double arg2, int arg3, int arg4);

    [LibraryImport(LibraryName, StringMarshalling = StringMarshalling.Utf8)]
    public static partial void AddTwo(InteropSafeHandle handle, int arg1, int arg2, int arg3, double arg4);

    [LibraryImport(LibraryName, StringMarshalling = StringMarshalling.Utf8)]
    public static partial void AddThree(InteropSafeHandle handle, uint arg1, string arg2);

    [LibraryImport(LibraryName, StringMarshalling = StringMarshalling.Utf8)]
    public static partial void AddFour(InteropSafeHandle handle, uint arg1, string arg2, uint arg3);

    [LibraryImport(LibraryName, StringMarshalling = StringMarshalling.Utf8)]
    public static partial void AddFive(InteropSafeHandle handle, uint arg1, double arg2, string arg3);

    [LibraryImport(LibraryName, StringMarshalling = StringMarshalling.Utf8)]
    public static partial void AddSix(InteropSafeHandle handle, uint arg1, double arg2, double arg3, double arg4, double arg5, double arg6, [MarshalAs(UnmanagedType.I1)] bool arg7, string arg8, double arg9, [MarshalAs(UnmanagedType.I1)] bool arg10);

    [LibraryImport(LibraryName)]
    public static partial void AddSeven(InteropSafeHandle handle, uint arg1, uint arg2);

    [LibraryImport(LibraryName)]
    public static partial void Calculate(InteropSafeHandle handle);

    [LibraryImport(LibraryName)]
    public static partial double GetResult(InteropSafeHandle handle, uint arg1);
}

internal class InteropSafeHandle : SafeHandleZeroOrMinusOneIsInvalid
{
    public InteropSafeHandle() : base(true)
    {
    }

    protected override bool ReleaseHandle()
    {
        Interop.ReleaseInstance(this.handle);
        return true;
    }
}

adamrodger commented 8 months ago

It's worth noting that, despite some strings going across the interop boundary, the C++ side never keeps a reference to any of those strings. All the other args should be copied by value (and blittable) anyway because they're just simple ints/bools/doubles.

The C# code to call it looks roughly like:

while (true)
{
    Calculate();

    // Force a full GC. Added for debugging only
    GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced, blocking: true, compacting: true);

    // The segfault happens during this call
    GC.Collect(GC.MaxGeneration, GCCollectionMode.Aggressive, blocking: true, compacting: true);
}

private void Calculate()
{
    // this loads a big JSON file from disk or calls a REST API or something
    var data = LoadData();

    InteropSafeHandle handle = Interop.CreateInstance();

    foreach (var foo in data.Foo)
    {
        Interop.AddOne(handle, foo.One, foo.Two, foo.Three, foo.Four);
        // and so on with all the other Add methods...
    }

    Interop.Calculate(handle);
    Console.WriteLine($"{Interop.GetResult(handle)}");

    handle.Dispose();
}

There is no async or anything in this minimal test app, it's an entirely single-threaded console app.

Edit: Updated the example code to make it clear everything has definitely gone out of scope by the time the GC is initiated.

adamrodger commented 8 months ago

I've also left it running a few times and managed to get a segfault on the LoadData call also (i.e. when we're doing lots of managed memory allocation). The backtrace and everything looks exactly the same as above though so I won't post it again, but that's just extra info to hopefully show why we think it's something to do with when GC is triggered.

I also tried using the mcr.microsoft.com/dotnet/sdk:8.0-jammy image just in case it was a Debian-only problem, but that also segfaults on Ubuntu in the same way with the same backtrace. I'd try on an Alpine image as well but the interop won't work on there.

alexrp commented 8 months ago

@adamrodger what kind of CPU are you seeing these crashes on? Are you able to reproduce the issue on other machines? (This may not matter; it's just that some of the symptoms here are vaguely similar to a GC crash I experienced, which turned out to actually be a CPU defect.)

adamrodger commented 8 months ago

@alexrp It crashes both on my local machine in Docker running a recent Intel i7 and also in Kubernetes in production. Those are also Intel but not sure exactly which type.

Looks like they run on Cascade Lake processors in Kubernetes.

alexrp commented 8 months ago

Ok, the issue I had was with a 13900K, so if you're seeing these crashes on Cascade Lake, it seems quite unlikely that it would be the same issue. :thinking:

rmsimpson commented 8 months ago

It's worth noting that, despite some strings going across the interop boundary, the C++ side never keeps a reference to any of those strings. All the other args should be copied by value (and blittable) anyway because they're just simple ints/bools/doubles.

Have you tried commenting out the AddXXX interop calls one at a time to see if any one particular call might be the culprit?

I've had a very similar interop issues to this one, where I've accidentally written too far past a bounded object, where it doesn't cause an immediate crash, but it does trigger a crash during a GC operation.

adamrodger commented 8 months ago

@rmsimpson That's a good suggestion, I'll give that a go.

adamrodger commented 8 months ago

I've managed to get it to segfault with literally just the CreateInstance and ReleaseInstance interop calls if I manually constrain the memory on the container to quite small (only 28MB, any smaller and I get OOM when loading the input data).

The backtrace etc. are the same as the previously reported ones.

Those create/release calls really don't do very much, so that's really odd.

rmsimpson commented 8 months ago

Are you running x86 or x64 on Linux? Or does it fail on both?
Are the functions declared in C++ with any explicit calling convention such as stdcall or cdecl? I know you obfuscated the functions to post here, but I'm wondering if the original code has explicit calling conventions that you may have neglected to post here. Compilations can sometimes be either cdecl or stdcall depending on CPU architecture and platform OS, unless the code explicitly declares one or the other.

adamrodger commented 8 months ago

It's running on x64 on Linux. I've not tried x86 because the C++ binary I have is built for x64 only. There are no calling conventions defined as far as I can see.

I think if the calling conventions were wrong wouldn't it fail immediately? What I actually experience is quite intermittent - the loop can get through multiple iterations before it fails sometimes, and other times it can fail on the first one or two.

rmsimpson commented 8 months ago

You'd think. I don't suppose you recently switched from DllImport to LibraryImport?

adamrodger commented 8 months ago

I've tried it with DllImport, LibraryImport and raw function pointers, all with the same result. I suppose they're all just pretty much different ways to do the same thing though.

huoyaoyuan commented 8 months ago

it's just that some of the symptoms here are vaguely similar to a GC crash I experienced, which turned out to actually be a CPU defect.)

@alexrp My 13900K also start to hit defect this year, just after I returned from new year holidays. Compiling CLR fails at random position and can't success for a whole pass. Turning off the aggressive turbo solves the problem. But I can't reproduce the problem with y-cruncher or other stress test.

Anyway, if it's a CPU defect, the failure should be totally random, and only under really heavy load.

adamrodger commented 8 months ago

I've written an equivalent app in both Rust and C++ which calls the same library via interop calls and executed it the same way, and neither of those apps ever get a segfault.

I can make those apps pause when I run them with heavily constrained memory, and they unpause as soon as I increase the memory (using docker run -m and docker update -m) which is perhaps a clue? The segfault seems to happen in the .Net GC so it makes sense to me that anything potentially causing weird behaviour around memory could trigger this problem in a managed runtime.

The segfault only happens in .Net and only since the upgrade to .Net 8. For now we've had to downgrade to .Net 7 and it's been stable since, but of course that version will go out of support very soon.

I've also run under Valgrind just to make sure there are no obvious memory problems with the library and nothing shows up there either.

adamrodger commented 7 months ago

I've done a bit of revisiting on this now that .Net Runtime 8.0.3 is out and I can still get it to fail. Every stack trace always has this at the top, to do with handling suspended threads:

(lldb) thread backtrace
* thread #1, name = 'dotnet', stop reason = signal SIGSEGV
  * frame #0: 0x00007f32654e594a libcoreclr.so`IsIPInEpilog(_CONTEXT*, EECodeInfo*, int*) [inlined] IsIPInProlog(pCodeInfo=0x00007f3265f7b328) at excep.cpp:7169:49
    frame #1: 0x00007f32654e5947 libcoreclr.so`IsIPInEpilog(pContextToCheck=0x00007f3265f7ce60, pCodeInfo=0x00007f3265f7b328, pSafeToInjectThreadAbort=YES) at excep.cpp:7236:9
    frame #2: 0x00007f326568fce4 libcoreclr.so`HandleSuspensionForInterruptedThread(interruptedContext=0x00007f3265f7ce60) at threadsuspend.cpp:5914:13
    frame #3: 0x00007f32658c503c libcoreclr.so`inject_activation_handler(code=<unavailable>, siginfo=<unavailable>, context=0x00007f3265f7dac0) at signal.cpp:840:13
    frame #4: 0x00007f3265aaf050 libc.so.6`___lldb_unnamed_symbol3252 + 1

I've also gone back and tried .Net runtimes 8.0.0, 8.0.1 and 8.0.2 but they all have the same problem with an intermittent segmentation fault with the same backtrace, always to do with suspended threads during garbage collection.

I've also added extra logging just to 100% confirm that the unmanaged code isn't somehow being released twice or something (like if somehow the finalizer was still running even after Dispose was called on the SafeHandle) and can confirm it definitely only releases the memory once.

jkotas commented 7 months ago

if I manually constrain the memory on the container to quite small (only 28MB

It is likely that OOM killer is crashing your app with 28MB memory constrain. You can check the OOM killer logs to see whether the OOM killer is responsible for the crash.

The OOM killer refuses to give the app a new page of memory at random points. .NET runtime is unable to reliably report Out Of Memory errors when the process is crashed by OOM killer.

adamrodger commented 7 months ago

The output is very different when the process is killed, like there's a different return code on the process and it prints "Killed" to the console instead of segfault/core dumped.

Also it always happens at exactly the same line of code in the .Net Runtime. That would be remarkably coincidental if that was due to OOM.

This segfault happens even when there's a lot of memory, it just becomes more likely to happen as memory is limited. For the purposes of recreating it I run with constrained memory just to make it happen more quickly.

Edit: Also the very small memory was in the equivalent Rust app, not the .Net one. I run the .Net app that can reproduce this reliably with 128MB.

me-viper commented 6 months ago

I've got the similar issue. The odd thing is that segfault happens out of nowhere. App is simple web service with some interop involved but at the time of the crash it does nothing (no requests, no background stuff) and it's not even near it's memory limits. This keeps happening ~ once per day-two. I've got bunch of memory dumps but the picture is always the same:

(7.be): Signal SIGSEGV (Segmentation fault) code 128 at 0x0
*** WARNING: Unable to verify timestamp for libcoreclr.so
libcoreclr!GetThread+0x8 [inlined in libcoreclr!HandleSuspensionForInterruptedThread+0x2a]:
00007fd2`16502ada 666648e82ecb2a00 call    libcoreclr+0x65e610 (00007fd2`167af610)

0:000> !dumpstack
*** WARNING: Unable to verify timestamp for doublemapper (deleted)
OS Thread Id: 0xbe (0)
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libcoreclr!HandleSuspensionForInterruptedThread + 0x2a [/__w/1/s/src/coreclr/vm/threadsuspend.h:5852]
Child-SP         RetAddr          Caller, Callee
00007FD2108C16A0 00007fd216720491 libcoreclr!ExecutionManager::IsManagedCodeWorker + 0x101 [/__w/1/s/src/coreclr/vm/codeman.cpp:4608], calling libcoreclr!EEJitManager::FindMethodCode [/__w/1/s/src/coreclr/vm/codeman.cpp:3995]
00007FD2108C16C0 00007fd21672023a libcoreclr!ExecutionManager::IsManagedCode + 0x9a [/__w/1/s/src/coreclr/vm/codeman.cpp:4531], calling libcoreclr!ExecutionManager::IsManagedCodeWorker [/__w/1/s/src/coreclr/vm/codeman.cpp:4588]
00007FD2108C1710 00007fd21673808c libcoreclr!inject_activation_handler + 0x9c [/__w/1/s/src/coreclr/pal/src/exception/signal.cpp:841], calling libcoreclr!InvokeActivationHandler [/__w/1/s/src/coreclr/pal/src/exception/signal.cpp:788]

!threads
ThreadCount:      34
UnstartedThread:  0
BackgroundThread: 8
PendingThread:    0
DeadThread:       25
Hosted Runtime:   no
                                                                                                            Lock  
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   7    1        7 0000558152DE87C0  2020020 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn 
   1    2        c 0000558152E2B320    21220 Preemptive  00007FD10A8036D0:00007FD10A804298 0000558152e36ef0 -00001 Ukn (Finalizer) 
  12    4        e 0000558152FEAFC0  2021220 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn 
  13    6       10 00007FD104006610  3021220 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
   2    8       12 00005581530949E0  2021220 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn 
   5    9       1a 00005581530A0500  2021220 Preemptive  00007FD10D875758:00007FD10D876770 0000558152e36ef0 -00001 Ukn 
  19   10       1b 00005581531B2DC0    21220 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn 
XXXX   14        0 00007FD100251550  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX    3        0 00007FD1000F2510  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   13        0 00007FD0F8006660  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   16        0 00007FD0F800A810  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   15        0 00007FD0F800B680  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   17        0 00007FD0F0053AA0  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   19        0 00007FD1000A9160  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   20        0 00007FD0FC0050C0  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   21        0 00007FD1001BB420  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   18        0 00007FD0F80BC670  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   22        0 00007FD0F80B99F0  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   23        0 00007FD0F80BAE70  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   25        0 00007FD0F800E3F0  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   26        0 00007FD1000CA940  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   24        0 00007FD1000CC0B0  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   27        0 00007FD0F0012700  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   28        0 00007FD0F006E380  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   29        0 00007FD0F0017800  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   30        0 00007FD0F002F510  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   31        0 00007FD0F0030C80  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   32        0 00007FD10024D9A0  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   33        0 00007FD1000CD820  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   34        0 00007FD1000CEF90  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   35        0 00007FD0F8009040  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
XXXX   36        0 00007FD0F00323F0  1031820 Preemptive  0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker) 
   6   37       bd 00007FD0F0033B60  1021220 Preemptive  00007FD10D8804D8:00007FD10D8804D8 0000558152e36ef0 -00001 Ukn (GC) (Threadpool Worker) 
   0   38       be 00007FD0F80AE700  1021222 Cooperative 00007FD10D87F2E0:00007FD10D880178 0000558152e36ef0 -00001 Ukn (Threadpool Worker)

dotnet --info

Host:
  Version:      8.0.2
  Architecture: x64
  Commit:       1381d5ebd2
  RID:          linux-x64

.NET SDKs installed:
  No SDKs were found.

.NET runtimes installed:
  Microsoft.AspNetCore.App 8.0.2 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 8.0.2 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

teh13th commented 6 months ago

I've got similar issue too: container crashes with segmentation fault if app has interop calls after migrating from net6 to net8. Used docker image: aspnet:8.0-bookworm-slim.

adamrodger commented 6 months ago

@teh13th Are you able to capture a stack trace or anything to see if it fails at the same point as my code?

teh13th commented 6 months ago

@adamrodger, I have only linux core dump. How to get stack from it using windbg?

adamrodger commented 6 months ago

Ah I've only done that on Linux using lldb with the dotnet-symbol tool to download the symbols that make the backtrace meaningful.

agocke commented 4 months ago

@AaronRobinsonMSFT Any ideas/suggestions?

AaronRobinsonMSFT commented 3 months ago

This doesn't look interop specific. Rather this is about thread suspension management during a GC. The generated interop transitions here are idempotent - the same no matter the configuration or settings. The fact that this is inconsistent would imply the generated interop code is either subtlely wrong - perhaps missing some barrier or something - or more likely the there is something amiss in the thread suspension logic path - signal handling down to the thread manager.

jkotas commented 3 months ago

@adamrodger Would you be able to check whether the C++ library registers any signal handlers with SA_ONSTACK flag?

.NET runtime uses SIGRTMIN signal for GC thread suspension. If some other component registers the signal handler for this signal and forwards the signal to .NET runtime while running with small alternative stack, it can lead to the crash that you are seeing.

adamrodger commented 3 months ago

I think something does happen with signals, but not that one. I've seen a SIGURG but no others as far as I know.

Thanks for continuing to look into this by the way 👍

dotnet-policy-service[bot] commented 3 months ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

agocke commented 3 months ago

Also @VSadov for threading

AaronRobinsonMSFT commented 3 days ago

@adamrodger We've updated our documentation about golang support. Can you confirm there is no golang in this process?

dotnet-policy-service[bot] commented 3 days ago

This issue has been marked needs-author-action and may be missing some important information.

me-viper commented 3 days ago

@adamrodger We've updated our documentation about golang support. Can you confirm there is no golang in this process?

There is in my case. I suspected interop with golang is the problem. Thanks for confirmation.

adamrodger commented 3 days ago

There is Golang in the library with which we're doing the interop 👍

dotnet / runtime