dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.38k stars 4.75k forks source link

Consider not using STACK_SIZE_PARAM_IS_A_RESERVATION anymore #45971

Open jhudsoncedaron opened 3 years ago

jhudsoncedaron commented 3 years ago

Description

We've been dealing with issues semi-periodically where someone will run the application server out of memory and hang or crash the application in interesting ways. In response (in order to ensure we can get a trace on any such crash) we've been systematically eliminating any ways memory allocation failures can bring the application down as these were either server hangs or meaningless crashes because some backgrounded task had no catchall at top level.

I think we've finally located the last one.

New managed threads are created in https://github.com/dotnet/runtime/blob/69e114c1abf91241a0eeecf1ecceab4711b8aa62/src/coreclr/vm/threads.cpp or possibly https://github.com/dotnet/runtime/blob/69e114c1abf91241a0eeecf1ecceab4711b8aa62/src/coreclr/vm/threads.cpp both of which use STACK_SIZE_PARAM_IS_A_RESERVATION .

The consequence of STACK_SIZE_PARAM_IS_A_RESERVATION is out of memory can in rare cases trigger an essentially randomly located stack overflow rather than a more reasonable OutOfMemoryException. As I said, we've been systematically tracking down and ensuring that OutOfMemoryException is handled. But we can't handle stack overflow exceptions from random places in the code.

Finally got sent this beauty:

Faulting application name: Cedaron.xxxxxxxxxxxxxx.exe, version: 10.0.1405.10004, time stamp: 0x5e597f86 Faulting module name: KERNELBASE.dll, version: 6.3.9600.19671, time stamp: 0x5e673d0b Exception code: 0xc00000fd Fault offset: 0x000000000000b3f7 Faulting process id: 0xde0 Faulting application start time: 0x01d6cda83754f5b2 Faulting application path: C:\Program Files\Cedaron xxxxxxxxxxxxxxxxxxx.exe Faulting module path: C:\Windows\system32\KERNELBASE.dll Report Id: 7d55039e-3bda-11eb-8106-005056bf56da Faulting package full name: Faulting package-relative application ID:

The application is 100% managed code, and it died with a native stack overflow. I would expect that stack memory is already allocated at thread creation, but it just isn't leading to the crashes being untraceable and unactionable. We can't even distinguish legit stackoverflows due to bugs in recursive functions from someone didn't give it enough memory again.

For the obvious reason, I can't include a small repro. Do you really want a memory bomb demo?

We are currently running net core 3.1, but the issue has existed for quite some time (potentially always existed) and still exists in trunk.

The only workaround I can imagine is hooking CreateThread and squashing the flag. Would be easier to build the runtime locally and remove the flag from the call sites.

Tangentially related to #8947 but fixing that will no longer help us.

danmoseley commented 3 years ago

@janvorli

janvorli commented 3 years ago

It seems that in majority of cases, the STACK_SIZE_PARAM_IS_A_RESERVATION is a good thing, as most threads would use just a small portion of the reserved stack space. If we stopped passing it, applications using e.g. many threadpool threads would crash with OOM even though they would never need to use that amount of memory. So I believe removing the STACK_SIZE_PARAM_IS_A_RESERVATION by default would hurt more than help. Moreover, on Unix, we don't have a way to control that, stack allocation is always a reservation (unless we would go and touch every stack page at the thread creation time). However, maybe we could add a new configuration setting that would cause the STACK_SIZE_PARAM_IS_A_RESERVATION to not to be passed in for the special cases like yours.

jhudsoncedaron commented 3 years ago

@janvorli Unix systems raise bus error instead of stack overflow when the stack memory can't be allocated. This makes the cases distinguishable. But if I turned off overcommit I would observe the memory being reserved immediately on clone() (not just the address space--this reserves space in the swap file, thus guaranteeing the demand can be met). But I'm pretty sure can't create thread in the threadpool is tolerated.