dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.99k stars 4.67k forks source link

[Request] Clarification and info on COMPlus_EnableAlternateStackCheck #97308

Closed dqwork closed 8 months ago

dqwork commented 8 months ago

Hi Dotnet team,

We've been wrestling with an interesting issue the last few weeks and whilst I think we've found a solution, its not a well documented one and there are still some unknowns. I'd like to get some more information around it if possible.

Original Issue

OS: Red Hat Enterprise Linux Server release 7.9 (Maipo) Dotnet Info

Issue: When a null reference exception was thrown the application would terminate immediately, with an error about stack smashing and the core dump created mentions a segmentation fault.

Likely linked Issues: https://github.com/dotnet/runtime/issues/12891

Solutions I've found (This what I was hoping for more information on)

dqwork commented 8 months ago

I found this PR which I think has cleared things up somewhat (I would appreciate any confirmation around this though)

https://github.com/dotnet/coreclr/pull/25972/files

It seems the env variable was introduced to allow some control for environments where the original fix didn't work. The PR discusses WSL - but I wonder if this also applies to RH7 (and hence why we see the difference in behavior).

Anyway any information anyone could provide would be really useful to help us pain the full picture

jkotas commented 8 months ago

cc @janvorli

janvorli commented 8 months ago

@dqwork let me try to clarify it in detail, please let me know if it is sufficient for you.

The .NET handler assumes that it is running on the alternate stack as it has registered the SIGSEGV with that option unless the COMPlus_EnableAlternateStackCheck is set. In that case it attempts to detect whether it is running on an alternate stack or not and behave correctly in both cases (with the exception of stack overflow when it is called on the original stack of the thread, where it has to live with the fact that it may crash). When the .NET SIGSEGV handler executes, then

When there is another SIGSEGV handler registered by some other component in the process, then it interacts with .NET handler depending on which of those two handlers was registered earlier.

dqwork commented 8 months ago

Thank you @janvorli for that detailed answer. I really appreciate it.

The fix that was introduced to try and handle this, seemed to not work on WSL and so that environment variable (COMPlus_EnableAlternateStackCheck) was introduced, to be a bit more of a blunt tool when some linux api couldn't be relied upon (correct me if I'm wrong there). Have you heard any reports of other Linux distros that didn't support the original fix? It seems RHEL7 has the same issue as WSL but I'm always wary when I'm the only one reporting that.

My hope is that I can hunt down some fairly solid confirmation that the environment variable isn't required on RHEL8 but a little unsure where I'd get that

janvorli commented 8 months ago

I would be very surprised if the RHEL 7/8 had the same issue as WSL 1. The issue was that while the handler was executing on an alternate stack, the uc_stack member of the ucontext_t passed to the signal handler was filled with zeros instead of containing valid values of ss_sp, ss_size and ss_flags. So maybe in your case, the other SIGSEGV handler is executed first and fiddles with these values in a way that prevents us from correctly detecting whether we are running on an alternate stack or not.

If you'd be able to run your app under lldb or gdb, you can set a breakpoint at the .NET sigsegv_handler function and then check the contents of the uc_stack in the ucontext_t passed to the handler and also see if the current stack pointer register (rsp in case of x64) is on the original stack or somewhere else.

dqwork commented 8 months ago

Thanks for the info - I may try debugging that on RHEL7 and see if I can see what causes it. RHEL8 has no issue, so thats why made me wonder if there was some kind of OS level issue that was fixed between the versions.

Anyway, I really appreciate you taking time to give me some info, its been really useful

janvorli commented 8 months ago

No problem, I am happy to help.