Open koorogi opened 2 weeks ago
Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.
The coreclr_initialize is supposed to be called on the main thread only. Looking at the stack trace above, it is actually quite strange, as it calls coreclr_initialize on a thread that was created by the coreclr itself (the thread started in CorUnix::CPalThread::ThreadEntry). So it looks like a bug in the netcoredb - it must have already initialized coreclr before, otherwise it would not be able to start a thread under it.
cc: @gbalykov
cc @viewizard
@koorogi please share your build command for netcoredbg and runtime
In netcoredbg this coreclr_initialize
called from ICorDebugManagedCallback::CreateProcess
(that is callback of debuggee process). Note, this is not start of debuggee process, but internal managed part load of debugger itself (we call managed delegates from native debugger code):
https://github.com/Samsung/netcoredbg/blob/e4512506fd211493864c4187fb52ff1384af8f5e/src/managed/interop.cpp#L214
Not really sure why this happens for musl only, can't reproduce thus issue on Ubuntu with .NET SDK 8.0.
can't reproduce thus issue on Ubuntu with .NET SDK 8.0.
ENSURE_PRIMARY_STACK_SIZE
is only enabled on linux-musl: https://github.com/dotnet/runtime/blob/4a7fe654d798a372f5786f026006437444f14f1e/src/coreclr/pal/src/CMakeLists.txt#L93-L98
You can workaround this by setting DOTNET_DefaultStackSize
to something less than 1.5MB before core_initialize
call and resetting it back to its original value after the call.
As for the proper fix, would it make sense to move the stack-committing to the host so that it is really only done on the main thread when the runtime owns the process?
Somewhat related to #96347
Note, this is not start of debuggee process, but internal managed part load of debugger itself (we call managed delegates from native debugger code):
What I don't understand though is that the coreclr_initialize is called on a thread that was created by coreclr. From the stack trace above
11 0x00007ffff69f44a0 in CorUnix::CPalThread::ThreadEntry (pvParam=0x7ffff67bf740)
at /var/tmp/portage/dev-dotnet/dotnet-sdk-8.0.105/work/dotnet-sdk-8.0.5/src/runtime/artifacts/source-build/self/src/src/coreclr/pal/src/thread/thread.cpp:1760
#12 0x00007ffff7fb69ae in start (p=0x7ffff67b89b0) at src/thread/pthread_create.c:207
#13 0x00007ffff7fc48b5 in __clone () at src/thread/x86_64/clone.s:22
The CorUnix::CPalThread::ThreadEntry
is what coreclr sets as the entry point for its threads. An external native thread would not have that there.
This is a thread created by CoreCLR PAL linked into libmscordbi.so. libmscordbi.so is out-proc debugger component., not a full CoreCLR runtime.
What I don't understand though is that the coreclr_initialize is called on a thread that was created by coreclr.
Yes, this thread created by coreclr (debugger part of it related to dbgshim) that call ICorDebugManagedCallback::CreateProcess
(debugger native code).
This is a thread created by CoreCLR PAL linked into libmscordbi.so.
Ah, I've missed that.
As for the proper fix, would it make sense to move the stack-committing to the host so that it is really only done on the main thread when the runtime owns the process?
That would potentially break existing 3rd party hosts and it is also kind of weird to expose this PAL internal behavior. I would rather check for the primary thread and ran the EnsureStackSize only when it is executing on the primary thread. It doesn't make sense to run it on other threads anyways. Essentially, just checking gettid() == getpid()
.
We should be setting the thread stack size iff we created the thread. If somebody else created the thread and hosting us, we should leave the thread stack size alone.
CoreCLR PAL does not have a reliable way to tell whether the main thread was created by us (the default dotnet host) or whether somebody else is hosting us. gettid() == getpid()
condition is an approximation. When the condition is false, we definitely know that it is not our thread as you have pointed out. When this condition is true, we cannot tell whether we own the process or whether we are hosted by an app that happened to initialize us on the main thread.
Perhaps something like:
bool IsThreadOwnedByCoreCLRPal()
{
return !pthread_getspecific(thObjKey);
}
#ifdef ENSURE_PRIMARY_STACK_SIZE
/*++
Function:
EnsureStackSize
Abstract:
This fixes a problem on MUSL where the initial stack size reported by the
pthread_attr_getstack is about 128kB, but this limit is not fixed and
the stack can grow dynamically. The problem is that it makes the
functions ReflectionInvocation::[Try]EnsureSufficientExecutionStack
to fail for real life scenarios like e.g. compilation of corefx.
Since there is no real fixed limit for the stack, the code below
ensures moving the stack limit to a value that makes reasonable
real life scenarios work.
--*/
__attribute__((noinline,NOOPT_ATTRIBUTE))
void
EnsureStackSize(SIZE_T stackSize)
{
+ // if somebody else created the thread and hosting us, we should leave the thread stack size alone.
+ if (!IsThreadOwnedByCoreCLRPal()) return;
+
volatile uint8_t *s = (uint8_t *)_alloca(stackSize);
*s = 0;
}
#endif // ENSURE_PRIMARY_STACK_SIZE
@am11 we need to do that for the primary thread only and never for secondary threads. That thread is always created by the OS. And even if we are hosted by a 3rd party host, we should ensure the stack size, because if this fails, there is not enough space on the stack left to ensure correct execution of the runtime.
We could make the check more involved. We could allocate only stackSize minus the already used part of the stack and also trim the stackSize by the stack rlimit in case someone sets that. I think that should make it behave correctly in all scenarios.
@janvorli, thanks. A question: is it absolutely necessary to ensure the stack size like that? I mean this program:
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <errno.h>
void print_stack_limit() {
struct rlimit rl;
if (getrlimit(RLIMIT_STACK, &rl) != 0) {
perror("getrlimit");
exit(EXIT_FAILURE);
}
printf("Current stack size limit: soft=%llu, hard=%llu\n", (unsigned long long)rl.rlim_cur, (unsigned long long)rl.rlim_max);
}
void set_stack_limit(rlim_t new_limit) {
struct rlimit rl;
rl.rlim_cur = new_limit;
rl.rlim_max = new_limit;
if (setrlimit(RLIMIT_STACK, &rl) != 0) {
perror("setrlimit");
exit(EXIT_FAILURE);
}
}
int main(int argc, char *argv[]) {
printf("Initial stack size limit:\n");
print_stack_limit();
if (argc > 1) {
rlim_t new_limit = strtoull(argv[1], NULL, 10) * 1024;
printf("Setting new stack size limit to %llu KB\n", (unsigned long long)new_limit / 1024);
set_stack_limit(new_limit);
printf("Updated stack size limit:\n");
print_stack_limit();
} else {
printf("Usage: %s <new_stack_size_in_KB>\n", argv[0]);
}
return 0;
}
with cc grow-stack.c && ./a.out 1024
prints:
Initial stack size limit:
Current stack size limit: soft=8388608, hard=18446744073709551615
Setting new stack size limit to 1024 KB
Updated stack size limit:
Current stack size limit: soft=1048576, hard=1048576
on ubuntu-arm64 and
Initial stack size limit:
Current stack size limit: soft=8388608, hard=18446744073709551615
Setting new stack size limit to 1024 KB
Updated stack size limit:
Current stack size limit: soft=1048576, hard=1048576
on alpine-arm64. So we can probably rely on the libc API without ensuring it (only) on linux-musl?
The reason was to ensure that pthread_attr_getstack
would return a realistic stack limit instead of just 128kB, so that the CPalThread::GetStackLimit can get it. But it seems that for primary thread, we can actually return the rlimit value minus the guard page size and we would not have to do the ensuring. Or it is even possible that current alpine now behaves the same as glibc, this stuff is from the early years of .NET core.
Or it is even possible that current alpine now behaves the same as glibc, this stuff is from the early years of .NET core.
cc @ayakael our resident Alpine Linux expert. 😉
I've just verified that on the latest Alpine, it still behaves the same way. The pthread_attr_getstack returns 128kB stack limit.
Description
Using netcoredbg (https://github.com/Samsung/netcoredbg) will crash with a segmentation fault when running on a musl-based Linux system.
Reproduction Steps
Expected behavior
Should be able to debug the program.
Actual behavior
The debugger crashes.
Regression?
No response
Known Workarounds
No response
Configuration
The issue is specific to the use of musl.
Other information
Backtrace from gdb:
The crash occurs in
EnsureStackSize
, which simply allocates 1.5MB on the stack and tries to write to it to ensure there is enough stack space available.Musl by default allocates smaller thread stack sizes than glibc does, but that's actually not the root issue in this case. This thread was created by .NET itself, and a fixed stack size (larger than the musl default) is requested when the thread is created.
The value of
g_defaultStackSize
which determines the requested stack size is set to a fixed value of 1.5MB on musl.This same size (
g_defaultStackSize
) is what is passed toEnsureStackSize
.So, a thread is created with a 1.5MB stack, and 13 stack frames deep, the code tries to verify that 1.5MB is still available on the stack, and it isn't because those 13 stack frames have already used some of it.
Based on https://github.com/dotnet/runtime/issues/9855#issuecomment-369389188, it sounds like
EnsureStackSize
is only intended to be used on the main thread, but the code that ensures this assumes thatInitialize
is called on the main thread, and that doesn't seem to be the case here.I don't know if netcoredbg is doing something wrong by calling into
Initialize
from a different thread, but it looks like the thread where this call is happening is theCordbRCEventThread::ThreadProc
thread.