Open adamrodger opened 8 months ago
Just to confirm our theory that the segfault happens during GC, I added the following at the end of the loop after the interop code has been disposed/freed:
GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced, blocking: true, compacting: true)
GC.Collect(GC.MaxGeneration, GCCollectionMode.Aggressive, blocking: true, compacting: true)
The segfault is still intermittent, but now it happens during the Aggressive
GC (the second one). The core dump name has now changed also. Now the dump is named core..NET Finalizer.8
and the thread backtrace is
* thread #1, name = 'dotnet', stop reason = signal SIGSEGV
* frame #0: 0x00007fec3fa2f98a libcoreclr.so`IsIPInEpilog(_CONTEXT*, EECodeInfo*, int*) [inlined] IsIPInProlog(pCodeInfo=0x00007fec3cb51328) at excep.cpp:7169:49
frame #1: 0x00007fec3fa2f987 libcoreclr.so`IsIPInEpilog(pContextToCheck=0x00007fec3cb52e60, pCodeInfo=0x00007fec3cb51328, pSafeToInjectThreadAbort=YES) at excep.cpp:7236:9
frame #2: 0x00007fec3fbd9ca4 libcoreclr.so`HandleSuspensionForInterruptedThread(interruptedContext=0x00007fec3cb52e60) at threadsuspend.cpp:5914:13
frame #3: 0x00007fec3fe0f08c libcoreclr.so`inject_activation_handler(code=<unavailable>, siginfo=<unavailable>, context=0x00007fec3cb53ac0) at signal.cpp:840:13
frame #4: 0x00007fec3fff9050 libc.so.6`___lldb_unnamed_symbol3252 + 1
frame #5: 0x00007febc0bd7aea
frame #6: 0x00007fec3fcc7ba6 libcoreclr.so`FastCallFinalizeWorker at calldescrworkeramd64.S:30
frame #7: 0x00007fec3fa84213 libcoreclr.so`MethodTable::CallFinalizer(Object*) at methodtable.cpp:4770:5
frame #8: 0x00007fec3fa841c9 libcoreclr.so`MethodTable::CallFinalizer(obj=0x00007feb92834630) at methodtable.cpp:4888:5
frame #9: 0x00007fec3fb3d955 libcoreclr.so`FinalizerThread::FinalizeAllObjects() [inlined] CallFinalizer(obj=0x00007feb92834630) at finalizerthread.cpp:75:9
frame #10: 0x00007fec3fb3d8fd libcoreclr.so`FinalizerThread::FinalizeAllObjects() at finalizerthread.cpp:104:9
frame #11: 0x00007fec3fb3dba5 libcoreclr.so`FinalizerThread::FinalizerThreadWorker(args=<unavailable>) at finalizerthread.cpp:348:9
frame #12: 0x00007fec3face7c5 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchInner(pCallState=<unavailable>) at threads.cpp:7222:5
frame #13: 0x00007fec3face7c3 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7266:9
frame #14: 0x00007fec3face788 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) [inlined] ManagedThreadBase_DispatchOuter(this=<unavailable>, pParam=<unavailable>)::$_0::operator()(ManagedThreadBase_DispatchOuter(ManagedThreadCallState*)::TryArgs*) const::'lambda'(Param*)::operator()(Param*) const at threads.cpp:7424:13
frame #15: 0x00007fec3face788 libcoreclr.so`ManagedThreadBase_DispatchOuter(ManagedThreadCallState*) at threads.cpp:7426:9
frame #16: 0x00007fec3face765 libcoreclr.so`ManagedThreadBase_DispatchOuter(pCallState=0x00007fec3d353de0) at threads.cpp:7450:5
frame #17: 0x00007fec3facedcd libcoreclr.so`ManagedThreadBase::FinalizerBase(void (*)(void*)) [inlined] ManagedThreadBase_NoADTransition(pTarget=<unavailable>, filterType=FinalizerThread)(void*), UnhandledExceptionLocation) at threads.cpp:7494:5
frame #18: 0x00007fec3facedb5 libcoreclr.so`ManagedThreadBase::FinalizerBase(pTarget=<unavailable>)(void*)) at threads.cpp:7513:5
frame #19: 0x00007fec3fb3de48 libcoreclr.so`FinalizerThread::FinalizerThreadStart(args=<unavailable>) at finalizerthread.cpp:398:17
frame #20: 0x00007fec3fe3effe libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x000055accab9f510) at thread.cpp:1760:16
frame #21: 0x00007fec40046134 libc.so.6`___lldb_unnamed_symbol3514 + 708
frame #22: 0x00007fec400c67dc libc.so.6`___lldb_unnamed_symbol3939 + 11
Could you please disassemble the code around the place that it is crashing at and find the exact instruction where the crash occurs?
cc @VSadov
Disasembling at the first frame gives
(lldb) disassemble -a 0x00007fec3fa2f98a
libcoreclr.so`IsIPInEpilog:
0x7fec3fa2f950 <+0>: pushq %rbp
0x7fec3fa2f951 <+1>: movq %rsp, %rbp
0x7fec3fa2f954 <+4>: pushq %r15
0x7fec3fa2f956 <+6>: pushq %r14
0x7fec3fa2f958 <+8>: pushq %r13
0x7fec3fa2f95a <+10>: pushq %r12
0x7fec3fa2f95c <+12>: pushq %rbx
0x7fec3fa2f95d <+13>: subq $0xd38, %rsp ; imm = 0xD38
0x7fec3fa2f964 <+20>: movq %rdx, %rbx
0x7fec3fa2f967 <+23>: movq %rsi, %r12
0x7fec3fa2f96a <+26>: movq %rdi, %r13
0x7fec3fa2f96d <+29>: movq %fs:0x28, %rax
0x7fec3fa2f976 <+38>: movq %rax, -0x30(%rbp)
0x7fec3fa2f97a <+42>: movq 0xf8(%rdi), %r14
0x7fec3fa2f981 <+49>: movl $0x1, (%rdx)
0x7fec3fa2f987 <+55>: movq %rsi, %rdi
-> 0x7fec3fa2f98a <+58>: callq 0x243610 ; EECodeInfo::GetFunctionEntry at jitinterface.cpp:14478
0x7fec3fa2f98f <+63>: movq 0x8(%r12), %rcx
0x7fec3fa2f994 <+68>: movq (%rcx), %rcx
0x7fec3fa2f997 <+71>: movl 0x8(%rax), %eax
0x7fec3fa2f99a <+74>: movzbl 0x1(%rcx,%rax), %eax
0x7fec3fa2f99f <+79>: cmpl %eax, 0x28(%r12)
0x7fec3fa2f9a4 <+84>: jae 0x2079ad ; <+93> at excep.cpp:7259:15
0x7fec3fa2f9a6 <+86>: xorl %eax, %eax
0x7fec3fa2f9a8 <+88>: jmp 0x207ab3 ; <+355> at excep.cpp:7306:1
0x7fec3fa2f9ad <+93>: movq $0x0, -0xd60(%rbp)
0x7fec3fa2f9b8 <+104>: movq %r12, %rdi
0x7fec3fa2f9bb <+107>: callq 0x243610 ; EECodeInfo::GetFunctionEntry at jitinterface.cpp:14478
0x7fec3fa2f9c0 <+112>: movq %rax, %r15
0x7fec3fa2f9c3 <+115>: movq 0x8(%r12), %rax
0x7fec3fa2f9c8 <+120>: movq (%rax), %r12
0x7fec3fa2f9cb <+123>: leaq -0xb50(%rbp), %rdi
0x7fec3fa2f9d2 <+130>: movl $0xb20, %edx ; imm = 0xB20
0x7fec3fa2f9d7 <+135>: xorl %esi, %esi
0x7fec3fa2f9d9 <+137>: callq 0x65e600 ; symbol stub for: memset
0x7fec3fa2f9de <+142>: leaq -0xc50(%rbp), %rdi
0x7fec3fa2f9e5 <+149>: movl $0x100, %edx ; imm = 0x100
0x7fec3fa2f9ea <+154>: movq %r13, %rsi
0x7fec3fa2f9ed <+157>: callq 0x65e660 ; symbol stub for: memcpy
0x7fec3fa2f9f2 <+162>: xorps %xmm0, %xmm0
0x7fec3fa2f9f5 <+165>: movaps %xmm0, -0xc60(%rbp)
0x7fec3fa2f9fc <+172>: movaps %xmm0, -0xc70(%rbp)
0x7fec3fa2fa03 <+179>: movaps %xmm0, -0xc80(%rbp)
0x7fec3fa2fa0a <+186>: movaps %xmm0, -0xc90(%rbp)
0x7fec3fa2fa11 <+193>: movaps %xmm0, -0xca0(%rbp)
0x7fec3fa2fa18 <+200>: movaps %xmm0, -0xcb0(%rbp)
0x7fec3fa2fa1f <+207>: movaps %xmm0, -0xcc0(%rbp)
0x7fec3fa2fa26 <+214>: movaps %xmm0, -0xcd0(%rbp)
0x7fec3fa2fa2d <+221>: movaps %xmm0, -0xce0(%rbp)
0x7fec3fa2fa34 <+228>: movaps %xmm0, -0xcf0(%rbp)
0x7fec3fa2fa3b <+235>: movaps %xmm0, -0xd00(%rbp)
0x7fec3fa2fa42 <+242>: movaps %xmm0, -0xd10(%rbp)
0x7fec3fa2fa49 <+249>: movaps %xmm0, -0xd20(%rbp)
0x7fec3fa2fa50 <+256>: movaps %xmm0, -0xd30(%rbp)
0x7fec3fa2fa57 <+263>: movaps %xmm0, -0xd40(%rbp)
0x7fec3fa2fa5e <+270>: movaps %xmm0, -0xd50(%rbp)
0x7fec3fa2fa65 <+277>: leaq -0xd50(%rbp), %rax
0x7fec3fa2fa6c <+284>: leaq -0xd60(%rbp), %r10
0x7fec3fa2fa73 <+291>: leaq -0xd58(%rbp), %r9
0x7fec3fa2fa7a <+298>: movl $0x1, %edi
0x7fec3fa2fa7f <+303>: movq %r12, %rsi
0x7fec3fa2fa82 <+306>: movq %r14, %rdx
0x7fec3fa2fa85 <+309>: movq %r15, %rcx
0x7fec3fa2fa88 <+312>: leaq -0xc50(%rbp), %r8
0x7fec3fa2fa8f <+319>: pushq %rax
0x7fec3fa2fa90 <+320>: pushq %r10
0x7fec3fa2fa92 <+322>: callq 0x3e2f70 ; RtlVirtualUnwind_Wrapper at excepamd64.cpp:151
0x7fec3fa2fa97 <+327>: addq $0x10, %rsp
0x7fec3fa2fa9b <+331>: movq %rax, %rcx
0x7fec3fa2fa9e <+334>: testq %rax, %rax
0x7fec3fa2faa1 <+337>: sete %al
0x7fec3fa2faa4 <+340>: orq -0xca8(%rbp), %rcx
0x7fec3fa2faab <+347>: jne 0x207ab3 ; <+355> at excep.cpp:7306:1
0x7fec3fa2faad <+349>: movl $0x0, (%rbx)
0x7fec3fa2fab3 <+355>: movq %fs:0x28, %rcx
0x7fec3fa2fabc <+364>: cmpq -0x30(%rbp), %rcx
0x7fec3fa2fac0 <+368>: jne 0x207ad4 ; <+388> at excep.cpp
0x7fec3fa2fac2 <+370>: addq $0xd38, %rsp ; imm = 0xD38
0x7fec3fa2fac9 <+377>: popq %rbx
0x7fec3fa2faca <+378>: popq %r12
0x7fec3fa2facc <+380>: popq %r13
0x7fec3fa2face <+382>: popq %r14
0x7fec3fa2fad0 <+384>: popq %r15
0x7fec3fa2fad2 <+386>: popq %rbp
0x7fec3fa2fad3 <+387>: retq
0x7fec3fa2fad4 <+388>: callq 0x65e5f0 ; symbol stub for: __stack_chk_fail
(lldb)
Could you please dump the register values? (info registers
)
but only once the interop has happened at least once. I've verified stubbing out the interop calls and then the app never segfaults.
Can we see the definitions of the C++ function and the .NET import definition?
Since this happens after an interop has occurred, usually it means something in the interop corrupted something, and the difference in behavior between .net 7 and 8 could just be due to timing, or memory being arranged, rearranged, allocated or collected differently.
Hopefully this is the correct registers command:
(lldb) register read
General Purpose Registers:
rax = 0x4530139429087c00
rbx = 0x00007fec3cb51324
rcx = 0x00007fec3d3539a8
rdx = 0x00007fec3cb51324
rdi = 0x00007fec3cb51328
rsi = 0x00007fec3cb51328
rbp = 0x00007fec3cb51310
rsp = 0x00007fec3cb505b0
r8 = 0x0000000000000000
r9 = 0x00c9c9ce88340394
r10 = 0x0000000000000014
r11 = 0x0000000000000002
r12 = 0x00007fec3cb51328
r13 = 0x00007fec3cb52e60
r14 = 0x00007febc0bd7aea
r15 = 0x00007fec3d354618
rip = 0x00007fec3fa2f98a libcoreclr.so`IsIPInEpilog(_CONTEXT*, EECodeInfo*, int*) + 58 [inlined] IsIPInProlog(EECodeInfo*) + 3 at excep.cpp:7169:49
libcoreclr.so`IsIPInEpilog(_CONTEXT*, EECodeInfo*, int*) + 55 at excep.cpp:7236:9
rflags = 0x0000000000010202
cs = 0x0000000000000033
fs = 0x0000000000000000
gs = 0x0000000000000000
ss = 0x000000000000002b
ds = 0x0000000000000000
es = 0x0000000000000000
The only ways for callq 0x243610
instruction to crash with segmentation fault is stack overflow, or (stack or code) memory getting unmapped, or (stack or code) memory protection being changed. It is not stackoverflow since the rsp
register is not at the page boundary. So the only possible explanation that I can think of is memory getting unmapped or memory protection being changed.
I can share the interop code, but obviously I've had to obfuscate the names slightly. I can't share the C++ library itself.
The C++ headers expose a C-like interop:
extern "C" {
typedef void *handle_t;
handle_t CreateInstance();
void ReleaseInstance(handle_t instance);
int32_t AddOne(handle_t instance, int32_t arg1, double arg2, int32_t arg3, int32_t arg4);
void AddTwo(handle_t instance, int32_t arg1, int32_t arg2, int32_t arg3, double arg4);
void AddThree(handle_t instance, uint32_t arg1, char *arg2);
void AddFour(handle_t instance, uint32_t arg1, const char *arg2, uint32_t arg3);
void AddFive(handle_t instance, uint32_t arg1, double arg2, const char *arg3);
void AddSix(handle_t instance, uint32_t arg1, double arg2, double arg3, double arg4, double arg5, double arg6, bool arg7, char *arg8, double arg9, bool arg10);
void AddSeven(handle_t instance, uint32_t arg1, uint32_t arg2);
void Calculate(handle_t instance);
double GetResult(handle_t instance, uint32_t arg1);
}
And the .Net side interops to it using the new library import source generator:
internal static partial class Interop
{
private const string LibraryName = "mylibrary";
[LibraryImport(LibraryName)]
public static partial InteropSafeHandle CreateInstance();
[LibraryImport(LibraryName)]
public static partial void ReleaseInstance(IntPtr handle);
[LibraryImport(LibraryName, StringMarshalling = StringMarshalling.Utf8)]
public static partial int AddOne(InteropSafeHandle handle, int arg1, double arg2, int arg3, int arg4);
[LibraryImport(LibraryName, StringMarshalling = StringMarshalling.Utf8)]
public static partial void AddTwo(InteropSafeHandle handle, int arg1, int arg2, int arg3, double arg4);
[LibraryImport(LibraryName, StringMarshalling = StringMarshalling.Utf8)]
public static partial void AddThree(InteropSafeHandle handle, uint arg1, string arg2);
[LibraryImport(LibraryName, StringMarshalling = StringMarshalling.Utf8)]
public static partial void AddFour(InteropSafeHandle handle, uint arg1, string arg2, uint arg3);
[LibraryImport(LibraryName, StringMarshalling = StringMarshalling.Utf8)]
public static partial void AddFive(InteropSafeHandle handle, uint arg1, double arg2, string arg3);
[LibraryImport(LibraryName, StringMarshalling = StringMarshalling.Utf8)]
public static partial void AddSix(InteropSafeHandle handle, uint arg1, double arg2, double arg3, double arg4, double arg5, double arg6, [MarshalAs(UnmanagedType.I1)] bool arg7, string arg8, double arg9, [MarshalAs(UnmanagedType.I1)] bool arg10);
[LibraryImport(LibraryName)]
public static partial void AddSeven(InteropSafeHandle handle, uint arg1, uint arg2);
[LibraryImport(LibraryName)]
public static partial void Calculate(InteropSafeHandle handle);
[LibraryImport(LibraryName)]
public static partial double GetResult(InteropSafeHandle handle, uint arg1);
}
internal class InteropSafeHandle : SafeHandleZeroOrMinusOneIsInvalid
{
public InteropSafeHandle() : base(true)
{
}
protected override bool ReleaseHandle()
{
Interop.ReleaseInstance(this.handle);
return true;
}
}
It's worth noting that, despite some strings going across the interop boundary, the C++ side never keeps a reference to any of those strings. All the other args should be copied by value (and blittable) anyway because they're just simple ints/bools/doubles.
The C# code to call it looks roughly like:
while (true)
{
Calculate();
// Force a full GC. Added for debugging only
GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced, blocking: true, compacting: true);
// The segfault happens during this call
GC.Collect(GC.MaxGeneration, GCCollectionMode.Aggressive, blocking: true, compacting: true);
}
private void Calculate()
{
// this loads a big JSON file from disk or calls a REST API or something
var data = LoadData();
InteropSafeHandle handle = Interop.CreateInstance();
foreach (var foo in data.Foo)
{
Interop.AddOne(handle, foo.One, foo.Two, foo.Three, foo.Four);
// and so on with all the other Add methods...
}
Interop.Calculate(handle);
Console.WriteLine($"{Interop.GetResult(handle)}");
handle.Dispose();
}
There is no async or anything in this minimal test app, it's an entirely single-threaded console app.
Edit: Updated the example code to make it clear everything has definitely gone out of scope by the time the GC is initiated.
I've also left it running a few times and managed to get a segfault on the LoadData
call also (i.e. when we're doing lots of managed memory allocation). The backtrace and everything looks exactly the same as above though so I won't post it again, but that's just extra info to hopefully show why we think it's something to do with when GC is triggered.
I also tried using the mcr.microsoft.com/dotnet/sdk:8.0-jammy
image just in case it was a Debian-only problem, but that also segfaults on Ubuntu in the same way with the same backtrace. I'd try on an Alpine image as well but the interop won't work on there.
@adamrodger what kind of CPU are you seeing these crashes on? Are you able to reproduce the issue on other machines? (This may not matter; it's just that some of the symptoms here are vaguely similar to a GC crash I experienced, which turned out to actually be a CPU defect.)
@alexrp It crashes both on my local machine in Docker running a recent Intel i7 and also in Kubernetes in production. Those are also Intel but not sure exactly which type.
Looks like they run on Cascade Lake processors in Kubernetes.
Ok, the issue I had was with a 13900K, so if you're seeing these crashes on Cascade Lake, it seems quite unlikely that it would be the same issue. :thinking:
It's worth noting that, despite some strings going across the interop boundary, the C++ side never keeps a reference to any of those strings. All the other args should be copied by value (and blittable) anyway because they're just simple ints/bools/doubles.
Have you tried commenting out the AddXXX interop calls one at a time to see if any one particular call might be the culprit?
I've had a very similar interop issues to this one, where I've accidentally written too far past a bounded object, where it doesn't cause an immediate crash, but it does trigger a crash during a GC operation.
@rmsimpson That's a good suggestion, I'll give that a go.
I've managed to get it to segfault with literally just the CreateInstance
and ReleaseInstance
interop calls if I manually constrain the memory on the container to quite small (only 28MB, any smaller and I get OOM when loading the input data).
The backtrace etc. are the same as the previously reported ones.
Those create/release calls really don't do very much, so that's really odd.
Are you running x86 or x64 on Linux? Or does it fail on both?
Are the functions declared in C++ with any explicit calling convention such as stdcall or cdecl? I know you obfuscated the functions to post here, but I'm wondering if the original code has explicit calling conventions that you may have neglected to post here. Compilations can sometimes be either cdecl or stdcall depending on CPU architecture and platform OS, unless the code explicitly declares one or the other.
It's running on x64 on Linux. I've not tried x86 because the C++ binary I have is built for x64 only. There are no calling conventions defined as far as I can see.
I think if the calling conventions were wrong wouldn't it fail immediately? What I actually experience is quite intermittent - the loop can get through multiple iterations before it fails sometimes, and other times it can fail on the first one or two.
You'd think. I don't suppose you recently switched from DllImport to LibraryImport?
I've tried it with DllImport, LibraryImport and raw function pointers, all with the same result. I suppose they're all just pretty much different ways to do the same thing though.
it's just that some of the symptoms here are vaguely similar to a GC crash I experienced, which turned out to actually be a CPU defect.)
@alexrp My 13900K also start to hit defect this year, just after I returned from new year holidays. Compiling CLR fails at random position and can't success for a whole pass. Turning off the aggressive turbo solves the problem. But I can't reproduce the problem with y-cruncher or other stress test.
Anyway, if it's a CPU defect, the failure should be totally random, and only under really heavy load.
I've written an equivalent app in both Rust and C++ which calls the same library via interop calls and executed it the same way, and neither of those apps ever get a segfault.
I can make those apps pause when I run them with heavily constrained memory, and they unpause as soon as I increase the memory (using docker run -m
and docker update -m
) which is perhaps a clue? The segfault seems to happen in the .Net GC so it makes sense to me that anything potentially causing weird behaviour around memory could trigger this problem in a managed runtime.
The segfault only happens in .Net and only since the upgrade to .Net 8. For now we've had to downgrade to .Net 7 and it's been stable since, but of course that version will go out of support very soon.
I've also run under Valgrind just to make sure there are no obvious memory problems with the library and nothing shows up there either.
I've done a bit of revisiting on this now that .Net Runtime 8.0.3 is out and I can still get it to fail. Every stack trace always has this at the top, to do with handling suspended threads:
(lldb) thread backtrace
* thread #1, name = 'dotnet', stop reason = signal SIGSEGV
* frame #0: 0x00007f32654e594a libcoreclr.so`IsIPInEpilog(_CONTEXT*, EECodeInfo*, int*) [inlined] IsIPInProlog(pCodeInfo=0x00007f3265f7b328) at excep.cpp:7169:49
frame #1: 0x00007f32654e5947 libcoreclr.so`IsIPInEpilog(pContextToCheck=0x00007f3265f7ce60, pCodeInfo=0x00007f3265f7b328, pSafeToInjectThreadAbort=YES) at excep.cpp:7236:9
frame #2: 0x00007f326568fce4 libcoreclr.so`HandleSuspensionForInterruptedThread(interruptedContext=0x00007f3265f7ce60) at threadsuspend.cpp:5914:13
frame #3: 0x00007f32658c503c libcoreclr.so`inject_activation_handler(code=<unavailable>, siginfo=<unavailable>, context=0x00007f3265f7dac0) at signal.cpp:840:13
frame #4: 0x00007f3265aaf050 libc.so.6`___lldb_unnamed_symbol3252 + 1
I've also gone back and tried .Net runtimes 8.0.0, 8.0.1 and 8.0.2 but they all have the same problem with an intermittent segmentation fault with the same backtrace, always to do with suspended threads during garbage collection.
I've also added extra logging just to 100% confirm that the unmanaged code isn't somehow being released twice or something (like if somehow the finalizer was still running even after Dispose
was called on the SafeHandle
) and can confirm it definitely only releases the memory once.
if I manually constrain the memory on the container to quite small (only 28MB
It is likely that OOM killer is crashing your app with 28MB memory constrain. You can check the OOM killer logs to see whether the OOM killer is responsible for the crash.
The OOM killer refuses to give the app a new page of memory at random points. .NET runtime is unable to reliably report Out Of Memory errors when the process is crashed by OOM killer.
The output is very different when the process is killed, like there's a different return code on the process and it prints "Killed" to the console instead of segfault/core dumped.
Also it always happens at exactly the same line of code in the .Net Runtime. That would be remarkably coincidental if that was due to OOM.
This segfault happens even when there's a lot of memory, it just becomes more likely to happen as memory is limited. For the purposes of recreating it I run with constrained memory just to make it happen more quickly.
Edit: Also the very small memory was in the equivalent Rust app, not the .Net one. I run the .Net app that can reproduce this reliably with 128MB.
I've got the similar issue. The odd thing is that segfault happens out of nowhere. App is simple web service with some interop involved but at the time of the crash it does nothing (no requests, no background stuff) and it's not even near it's memory limits. This keeps happening ~ once per day-two. I've got bunch of memory dumps but the picture is always the same:
(7.be): Signal SIGSEGV (Segmentation fault) code 128 at 0x0
*** WARNING: Unable to verify timestamp for libcoreclr.so
libcoreclr!GetThread+0x8 [inlined in libcoreclr!HandleSuspensionForInterruptedThread+0x2a]:
00007fd2`16502ada 666648e82ecb2a00 call libcoreclr+0x65e610 (00007fd2`167af610)
0:000> !dumpstack
*** WARNING: Unable to verify timestamp for doublemapper (deleted)
OS Thread Id: 0xbe (0)
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libcoreclr!HandleSuspensionForInterruptedThread + 0x2a [/__w/1/s/src/coreclr/vm/threadsuspend.h:5852]
Child-SP RetAddr Caller, Callee
00007FD2108C16A0 00007fd216720491 libcoreclr!ExecutionManager::IsManagedCodeWorker + 0x101 [/__w/1/s/src/coreclr/vm/codeman.cpp:4608], calling libcoreclr!EEJitManager::FindMethodCode [/__w/1/s/src/coreclr/vm/codeman.cpp:3995]
00007FD2108C16C0 00007fd21672023a libcoreclr!ExecutionManager::IsManagedCode + 0x9a [/__w/1/s/src/coreclr/vm/codeman.cpp:4531], calling libcoreclr!ExecutionManager::IsManagedCodeWorker [/__w/1/s/src/coreclr/vm/codeman.cpp:4588]
00007FD2108C1710 00007fd21673808c libcoreclr!inject_activation_handler + 0x9c [/__w/1/s/src/coreclr/pal/src/exception/signal.cpp:841], calling libcoreclr!InvokeActivationHandler [/__w/1/s/src/coreclr/pal/src/exception/signal.cpp:788]
!threads
ThreadCount: 34
UnstartedThread: 0
BackgroundThread: 8
PendingThread: 0
DeadThread: 25
Hosted Runtime: no
Lock
DBG ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
7 1 7 0000558152DE87C0 2020020 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn
1 2 c 0000558152E2B320 21220 Preemptive 00007FD10A8036D0:00007FD10A804298 0000558152e36ef0 -00001 Ukn (Finalizer)
12 4 e 0000558152FEAFC0 2021220 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn
13 6 10 00007FD104006610 3021220 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
2 8 12 00005581530949E0 2021220 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn
5 9 1a 00005581530A0500 2021220 Preemptive 00007FD10D875758:00007FD10D876770 0000558152e36ef0 -00001 Ukn
19 10 1b 00005581531B2DC0 21220 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn
XXXX 14 0 00007FD100251550 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 3 0 00007FD1000F2510 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 13 0 00007FD0F8006660 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 16 0 00007FD0F800A810 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 15 0 00007FD0F800B680 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 17 0 00007FD0F0053AA0 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 19 0 00007FD1000A9160 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 20 0 00007FD0FC0050C0 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 21 0 00007FD1001BB420 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 18 0 00007FD0F80BC670 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 22 0 00007FD0F80B99F0 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 23 0 00007FD0F80BAE70 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 25 0 00007FD0F800E3F0 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 26 0 00007FD1000CA940 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 24 0 00007FD1000CC0B0 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 27 0 00007FD0F0012700 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 28 0 00007FD0F006E380 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 29 0 00007FD0F0017800 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 30 0 00007FD0F002F510 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 31 0 00007FD0F0030C80 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 32 0 00007FD10024D9A0 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 33 0 00007FD1000CD820 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 34 0 00007FD1000CEF90 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 35 0 00007FD0F8009040 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
XXXX 36 0 00007FD0F00323F0 1031820 Preemptive 0000000000000000:0000000000000000 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
6 37 bd 00007FD0F0033B60 1021220 Preemptive 00007FD10D8804D8:00007FD10D8804D8 0000558152e36ef0 -00001 Ukn (GC) (Threadpool Worker)
0 38 be 00007FD0F80AE700 1021222 Cooperative 00007FD10D87F2E0:00007FD10D880178 0000558152e36ef0 -00001 Ukn (Threadpool Worker)
dotnet --info
Host:
Version: 8.0.2
Architecture: x64
Commit: 1381d5ebd2
RID: linux-x64
.NET SDKs installed:
No SDKs were found.
.NET runtimes installed:
Microsoft.AspNetCore.App 8.0.2 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 8.0.2 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
I've got similar issue too: container crashes with segmentation fault if app has interop calls after migrating from net6 to net8. Used docker image: aspnet:8.0-bookworm-slim.
@teh13th Are you able to capture a stack trace or anything to see if it fails at the same point as my code?
@adamrodger, I have only linux core dump. How to get stack from it using windbg?
Ah I've only done that on Linux using lldb with the dotnet-symbol tool to download the symbols that make the backtrace meaningful.
@AaronRobinsonMSFT Any ideas/suggestions?
This doesn't look interop specific. Rather this is about thread suspension management during a GC. The generated interop transitions here are idempotent - the same no matter the configuration or settings. The fact that this is inconsistent would imply the generated interop code is either subtlely wrong - perhaps missing some barrier or something - or more likely the there is something amiss in the thread suspension logic path - signal handling down to the thread manager.
@adamrodger Would you be able to check whether the C++ library registers any signal handlers with SA_ONSTACK
flag?
.NET runtime uses SIGRTMIN
signal for GC thread suspension. If some other component registers the signal handler for this signal and forwards the signal to .NET runtime while running with small alternative stack, it can lead to the crash that you are seeing.
I think something does happen with signals, but not that one. I've seen a SIGURG but no others as far as I know.
Thanks for continuing to look into this by the way 👍
Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.
Also @VSadov for threading
@adamrodger We've updated our documentation about golang support. Can you confirm there is no golang in this process?
This issue has been marked needs-author-action
and may be missing some important information.
@adamrodger We've updated our documentation about golang support. Can you confirm there is no golang in this process?
There is in my case. I suspected interop with golang is the problem. Thanks for confirmation.
There is Golang in the library with which we're doing the interop 👍
Description
I have a C# application which does interop with a native C++ library which has started to intermittently crash with a segmentation fault since upgrading to .Net 8. The code is unmodified from prior to the upgrade, and has worked in the past on .Net 5, 6 and 7 in production use without a segfault.
Reproduction Steps
I have written a few smaller apps to reproduce the issue, but it's difficult to share them given they rely on calling the proprietary C++ library.
Essentially the apps:
The segfault always happens during the data load step (i.e. when a large amount of managed memory is being allocated), but only once the interop has happened at least once. I've verified stubbing out the interop calls and then the app never segfaults.
Expected behavior
The app works as per .Net 7 version.
Actual behavior
The app crashes with a segfault intermittently.
Regression?
Yes, the app worked on .Net 7 and I can change the version back to .Net 7 and it continues to work fine.
Known Workarounds
No known workarounds other than not using .Net 8, although the problem does get worse the less memory the container has. Our theory is that this triggers more frequent garbage collections and this is where the segfault occurs.
Configuration
.Net Version: 8.0.2 OS: Linux (Debian bookworm) Container image:
mcr.microsoft.com/dotnet/sdk:8.0.2
(used for analysing core dumps) and also inmcr.microsoft.com/dotnet/runtime:8.0.2
in production Architecture: x86-64Other information
LLDB backtrace:
LLDB frame variable:
I've tried to configure
createdump
to capture core dumps when the application crashes but it never seems to trigger (it does on other apps). I assume the segfault prevents thecreatedump
from triggering so I only have the standard core dump written by Linux. I've tried to get the CLR stack output but the core dump doesn't seem to work with SOS.I've also run the test apps on Windows and I can never get them to segfault. They only seem to segfault on Linux.