SIGSEGV in netcoredbg when running on Linux with musl libc

koorogi commented 2 weeks ago

Description

Using netcoredbg (https://github.com/Samsung/netcoredbg) will crash with a segmentation fault when running on a musl-based Linux system.

Reproduction Steps

$ dotnet new console -o crash
The template "Console App" was created successfully.

Processing post-creation actions...
Restoring /home/koorogi/tmp/crash/crash/crash.csproj:
  Determining projects to restore...
  Restored /home/koorogi/tmp/crash/crash/crash.csproj (in 589 ms).
Restore succeeded.

$ cd crash
$ dotnet build
MSBuild version 17.8.5+b5265ef37 for .NET
  Determining projects to restore...
  All projects are up-to-date for restore.
  crash -> /home/koorogi/tmp/crash/crash/bin/Debug/net8.0/crash.dll

Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:00:01.73
$ netcoredbg
ncdb> file bin/Debug/net8.0/crash
ncdb> run
Segmentation fault

Expected behavior

Should be able to debug the program.

Actual behavior

The debugger crashes.

Regression?

No response

Known Workarounds

No response

Configuration

.NET 8.0.105
Gentoo Linux
Musl libc
x86_64

The issue is specific to the use of musl.

Other information

Backtrace from gdb:

(gdb) bt
#0  0x00007ffff72fd1f9 in EnsureStackSize (stackSize=1572864)
    at /var/tmp/portage/dev-dotnet/dotnet-sdk-8.0.105/work/dotnet-sdk-8.0.5/src/runtime/artifacts/source-build/self/src/src/coreclr/pal/src/init/pal.cpp:266
#1  0x00007ffff72fcbe3 in Initialize (argc=argc@entry=1, argv=argv@entry=0x7ffff67b7f78, 
    flags=flags@entry=255)
    at /var/tmp/portage/dev-dotnet/dotnet-sdk-8.0.105/work/dotnet-sdk-8.0.5/src/runtime/artifacts/source-build/self/src/src/coreclr/pal/src/init/pal.cpp:410
#2  0x00007ffff72fd35e in PAL_InitializeCoreCLR (
    szExePath=0x7ffff66333e0 "/usr/lib/netcoredbg/netcoredbg", runningInExe=<optimized out>)
    at /var/tmp/portage/dev-dotnet/dotnet-sdk-8.0.105/work/dotnet-sdk-8.0.5/src/runtime/artifacts/source-build/self/src/src/coreclr/pal/src/init/pal.cpp:778
#3  0x00007ffff6e58b13 in coreclr_initialize (exePath=0x7ffff66333e0 "/usr/lib/netcoredbg/netcoredbg", 
    appDomainFriendlyName=appDomainFriendlyName@entry=0x5555556ecdea "debugger", 
    propertyCount=propertyCount@entry=5, propertyKeys=propertyKeys@entry=0x7ffff67b81a0, 
    propertyValues=propertyValues@entry=0x7ffff67b81e0, 
    hostHandle=hostHandle@entry=0x555555746c60 <netcoredbg::Interop::(anonymous namespace)::hostHandle>, 
    domainId=0x555555746c58 <netcoredbg::Interop::(anonymous namespace)::domainId>)
    at /var/tmp/portage/dev-dotnet/dotnet-sdk-8.0.105/work/dotnet-sdk-8.0.5/src/runtime/artifacts/source-build/self/src/src/coreclr/dlls/mscoree/exports.cpp:275
#4  0x0000555555647baa in netcoredbg::Interop::Init (coreClrPath=...)
    at /usr/lib/gcc/x86_64-pc-linux-musl/14/include/g++-v14/bits/basic_string.h:227
#5  0x000055555561d2a0 in netcoredbg::ManagedCallback::CreateProcessW (this=0x7ffff67b9e90, 
    pProcess=0x7ffff7417b08)
    at /var/tmp/portage/dev-dotnet/netcoredbg-3.0.0.1018-r1/work/netcoredbg-3.0.0-1018/src/debugger/managedcallback.cpp:157
#6  0x00007ffff6b1006d in ShimProxyCallback::QueueCreateProcess(ICorDebugProcess*)::CreateProcessEvent::Dispatch(ManagedEvent::DispatchArgs) (this=0x7ffff7418b50, args=...)
    at /var/tmp/portage/dev-dotnet/dotnet-sdk-8.0.105/work/dotnet-sdk-8.0.5/src/runtime/artifacts/source-build/self/src/src/coreclr/debug/di/shimcallback.cpp:350
#7  0x00007ffff6af93fe in CordbProcess::DispatchRCEvent (this=this@entry=0x7ffff7417ae0)
    at /var/tmp/portage/dev-dotnet/dotnet-sdk-8.0.105/work/dotnet-sdk-8.0.5/src/runtime/artifacts/source-build/self/src/src/coreclr/debug/di/process.cpp:4701
#8  0x00007ffff6b025c8 in CordbRCEventThread::FlushQueuedEvents (this=this@entry=0x7ffff67c1460, 
    process=process@entry=0x7ffff7417ae0)
    at /var/tmp/portage/dev-dotnet/dotnet-sdk-8.0.105/work/dotnet-sdk-8.0.5/src/runtime/artifacts/source-build/self/src/src/coreclr/debug/di/process.cpp:10296
#9  0x00007ffff6b02e6c in CordbRCEventThread::ThreadProc (this=0x7ffff67c1460)
    at /var/tmp/portage/dev-dotnet/dotnet-sdk-8.0.105/work/dotnet-sdk-8.0.5/src/runtime/artifacts/source-build/self/src/src/coreclr/debug/di/process.cpp:10554
#10 0x00007ffff6b02f19 in CordbRCEventThread::ThreadProc (parameter=0x180000)
    at /var/tmp/portage/dev-dotnet/dotnet-sdk-8.0.105/work/dotnet-sdk-8.0.5/src/runtime/artifacts/source-build/self/src/src/coreclr/debug/di/process.cpp:10620
#11 0x00007ffff69f44a0 in CorUnix::CPalThread::ThreadEntry (pvParam=0x7ffff67bf740)
    at /var/tmp/portage/dev-dotnet/dotnet-sdk-8.0.105/work/dotnet-sdk-8.0.5/src/runtime/artifacts/source-build/self/src/src/coreclr/pal/src/thread/thread.cpp:1760
#12 0x00007ffff7fb69ae in start (p=0x7ffff67b89b0) at src/thread/pthread_create.c:207
#13 0x00007ffff7fc48b5 in __clone () at src/thread/x86_64/clone.s:22
Backtrace stopped: frame did not save the PC

The crash occurs in EnsureStackSize, which simply allocates 1.5MB on the stack and tries to write to it to ensure there is enough stack space available.

Musl by default allocates smaller thread stack sizes than glibc does, but that's actually not the root issue in this case. This thread was created by .NET itself, and a fixed stack size (larger than the musl default) is requested when the thread is created.

The value of g_defaultStackSize which determines the requested stack size is set to a fixed value of 1.5MB on musl.

This same size (g_defaultStackSize) is what is passed to EnsureStackSize.

So, a thread is created with a 1.5MB stack, and 13 stack frames deep, the code tries to verify that 1.5MB is still available on the stack, and it isn't because those 13 stack frames have already used some of it.

Based on https://github.com/dotnet/runtime/issues/9855#issuecomment-369389188, it sounds like EnsureStackSize is only intended to be used on the main thread, but the code that ensures this assumes that Initialize is called on the main thread, and that doesn't seem to be the case here.

I don't know if netcoredbg is doing something wrong by calling into Initialize from a different thread, but it looks like the thread where this call is happening is the CordbRCEventThread::ThreadProc thread.

dotnet-policy-service[bot] commented 2 weeks ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

janvorli commented 2 weeks ago

The coreclr_initialize is supposed to be called on the main thread only. Looking at the stack trace above, it is actually quite strange, as it calls coreclr_initialize on a thread that was created by the coreclr itself (the thread started in CorUnix::CPalThread::ThreadEntry). So it looks like a bug in the netcoredb - it must have already initialized coreclr before, otherwise it would not be able to start a thread under it.

janvorli commented 2 weeks ago

cc: @gbalykov

gbalykov commented 2 weeks ago

cc @viewizard

gbalykov commented 2 weeks ago

@koorogi please share your build command for netcoredbg and runtime

viewizard commented 2 weeks ago

In netcoredbg this coreclr_initialize called from ICorDebugManagedCallback::CreateProcess (that is callback of debuggee process). Note, this is not start of debuggee process, but internal managed part load of debugger itself (we call managed delegates from native debugger code): https://github.com/Samsung/netcoredbg/blob/e4512506fd211493864c4187fb52ff1384af8f5e/src/managed/interop.cpp#L214

Not really sure why this happens for musl only, can't reproduce thus issue on Ubuntu with .NET SDK 8.0.

am11 commented 2 weeks ago

can't reproduce thus issue on Ubuntu with .NET SDK 8.0.

ENSURE_PRIMARY_STACK_SIZE is only enabled on linux-musl: https://github.com/dotnet/runtime/blob/4a7fe654d798a372f5786f026006437444f14f1e/src/coreclr/pal/src/CMakeLists.txt#L93-L98

jkotas commented 2 weeks ago

You can workaround this by setting DOTNET_DefaultStackSize to something less than 1.5MB before core_initialize call and resetting it back to its original value after the call.

As for the proper fix, would it make sense to move the stack-committing to the host so that it is really only done on the main thread when the runtime owns the process?

jkotas commented 2 weeks ago

Somewhat related to #96347

janvorli commented 2 weeks ago

Note, this is not start of debuggee process, but internal managed part load of debugger itself (we call managed delegates from native debugger code):

What I don't understand though is that the coreclr_initialize is called on a thread that was created by coreclr. From the stack trace above

11 0x00007ffff69f44a0 in CorUnix::CPalThread::ThreadEntry (pvParam=0x7ffff67bf740)
    at /var/tmp/portage/dev-dotnet/dotnet-sdk-8.0.105/work/dotnet-sdk-8.0.5/src/runtime/artifacts/source-build/self/src/src/coreclr/pal/src/thread/thread.cpp:1760
#12 0x00007ffff7fb69ae in start (p=0x7ffff67b89b0) at src/thread/pthread_create.c:207
#13 0x00007ffff7fc48b5 in __clone () at src/thread/x86_64/clone.s:22

The CorUnix::CPalThread::ThreadEntry is what coreclr sets as the entry point for its threads. An external native thread would not have that there.

jkotas commented 2 weeks ago

This is a thread created by CoreCLR PAL linked into libmscordbi.so. libmscordbi.so is out-proc debugger component., not a full CoreCLR runtime.

viewizard commented 2 weeks ago

What I don't understand though is that the coreclr_initialize is called on a thread that was created by coreclr.

Yes, this thread created by coreclr (debugger part of it related to dbgshim) that call ICorDebugManagedCallback::CreateProcess (debugger native code).

janvorli commented 2 weeks ago

This is a thread created by CoreCLR PAL linked into libmscordbi.so.

Ah, I've missed that.

janvorli commented 2 weeks ago

As for the proper fix, would it make sense to move the stack-committing to the host so that it is really only done on the main thread when the runtime owns the process?

That would potentially break existing 3rd party hosts and it is also kind of weird to expose this PAL internal behavior. I would rather check for the primary thread and ran the EnsureStackSize only when it is executing on the primary thread. It doesn't make sense to run it on other threads anyways. Essentially, just checking gettid() == getpid().

jkotas commented 2 weeks ago

We should be setting the thread stack size iff we created the thread. If somebody else created the thread and hosting us, we should leave the thread stack size alone.

CoreCLR PAL does not have a reliable way to tell whether the main thread was created by us (the default dotnet host) or whether somebody else is hosting us. gettid() == getpid() condition is an approximation. When the condition is false, we definitely know that it is not our thread as you have pointed out. When this condition is true, we cannot tell whether we own the process or whether we are hosted by an app that happened to initialize us on the main thread.

am11 commented 2 weeks ago

Perhaps something like:

bool IsThreadOwnedByCoreCLRPal()
{
    return !pthread_getspecific(thObjKey);
}

#ifdef ENSURE_PRIMARY_STACK_SIZE
/*++
Function:
  EnsureStackSize

Abstract:
  This fixes a problem on MUSL where the initial stack size reported by the
  pthread_attr_getstack is about 128kB, but this limit is not fixed and
  the stack can grow dynamically. The problem is that it makes the
  functions ReflectionInvocation::[Try]EnsureSufficientExecutionStack
  to fail for real life scenarios like e.g. compilation of corefx.
  Since there is no real fixed limit for the stack, the code below
  ensures moving the stack limit to a value that makes reasonable
  real life scenarios work.

 --*/
__attribute__((noinline,NOOPT_ATTRIBUTE))
void
EnsureStackSize(SIZE_T stackSize)
{
+     // if somebody else created the thread and hosting us, we should leave the thread stack size alone. 
+    if (!IsThreadOwnedByCoreCLRPal()) return;
+
    volatile uint8_t *s = (uint8_t *)_alloca(stackSize);
    *s = 0;
}
#endif // ENSURE_PRIMARY_STACK_SIZE

janvorli commented 2 weeks ago

@am11 we need to do that for the primary thread only and never for secondary threads. That thread is always created by the OS. And even if we are hosted by a 3rd party host, we should ensure the stack size, because if this fails, there is not enough space on the stack left to ensure correct execution of the runtime.

janvorli commented 2 weeks ago

We could make the check more involved. We could allocate only stackSize minus the already used part of the stack and also trim the stackSize by the stack rlimit in case someone sets that. I think that should make it behave correctly in all scenarios.

am11 commented 2 weeks ago

@janvorli, thanks. A question: is it absolutely necessary to ensure the stack size like that? I mean this program:

#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <errno.h>

void print_stack_limit() {
    struct rlimit rl;
    if (getrlimit(RLIMIT_STACK, &rl) != 0) {
        perror("getrlimit");
        exit(EXIT_FAILURE);
    }
    printf("Current stack size limit: soft=%llu, hard=%llu\n", (unsigned long long)rl.rlim_cur, (unsigned long long)rl.rlim_max);
}

void set_stack_limit(rlim_t new_limit) {
    struct rlimit rl;
    rl.rlim_cur = new_limit;
    rl.rlim_max = new_limit;

    if (setrlimit(RLIMIT_STACK, &rl) != 0) {
        perror("setrlimit");
        exit(EXIT_FAILURE);
    }
}

int main(int argc, char *argv[]) {
    printf("Initial stack size limit:\n");
    print_stack_limit();

    if (argc > 1) {
        rlim_t new_limit = strtoull(argv[1], NULL, 10) * 1024;
        printf("Setting new stack size limit to %llu KB\n", (unsigned long long)new_limit / 1024);
        set_stack_limit(new_limit);

        printf("Updated stack size limit:\n");
        print_stack_limit();
    } else {
        printf("Usage: %s <new_stack_size_in_KB>\n", argv[0]);
    }

    return 0;
}

with cc grow-stack.c && ./a.out 1024 prints:

Initial stack size limit:
Current stack size limit: soft=8388608, hard=18446744073709551615
Setting new stack size limit to 1024 KB
Updated stack size limit:
Current stack size limit: soft=1048576, hard=1048576

on ubuntu-arm64 and

Initial stack size limit:
Current stack size limit: soft=8388608, hard=18446744073709551615
Setting new stack size limit to 1024 KB
Updated stack size limit:
Current stack size limit: soft=1048576, hard=1048576

on alpine-arm64. So we can probably rely on the libc API without ensuring it (only) on linux-musl?

janvorli commented 2 weeks ago

The reason was to ensure that pthread_attr_getstack would return a realistic stack limit instead of just 128kB, so that the CPalThread::GetStackLimit can get it. But it seems that for primary thread, we can actually return the rlimit value minus the guard page size and we would not have to do the ensuring. Or it is even possible that current alpine now behaves the same as glibc, this stuff is from the early years of .NET core.

am11 commented 2 weeks ago

Or it is even possible that current alpine now behaves the same as glibc, this stuff is from the early years of .NET core.

cc @ayakael our resident Alpine Linux expert. 😉

janvorli commented 2 weeks ago

I've just verified that on the latest Alpine, it still behaves the same way. The pthread_attr_getstack returns 128kB stack limit.

dotnet / runtime