Closed dqwork closed 8 months ago
I found this PR which I think has cleared things up somewhat (I would appreciate any confirmation around this though)
https://github.com/dotnet/coreclr/pull/25972/files
It seems the env variable was introduced to allow some control for environments where the original fix didn't work. The PR discusses WSL - but I wonder if this also applies to RH7 (and hence why we see the difference in behavior).
Anyway any information anyone could provide would be really useful to help us pain the full picture
cc @janvorli
@dqwork let me try to clarify it in detail, please let me know if it is sufficient for you.
The .NET handler assumes that it is running on the alternate stack as it has registered the SIGSEGV with that option unless the COMPlus_EnableAlternateStackCheck is set. In that case it attempts to detect whether it is running on an alternate stack or not and behave correctly in both cases (with the exception of stack overflow when it is called on the original stack of the thread, where it has to live with the fact that it may crash). When the .NET SIGSEGV handler executes, then
When there is another SIGSEGV handler registered by some other component in the process, then it interacts with .NET handler depending on which of those two handlers was registered earlier.
Thank you @janvorli for that detailed answer. I really appreciate it.
The fix that was introduced to try and handle this, seemed to not work on WSL and so that environment variable (COMPlus_EnableAlternateStackCheck
) was introduced, to be a bit more of a blunt tool when some linux api couldn't be relied upon (correct me if I'm wrong there).
Have you heard any reports of other Linux distros that didn't support the original fix? It seems RHEL7 has the same issue as WSL but I'm always wary when I'm the only one reporting that.
My hope is that I can hunt down some fairly solid confirmation that the environment variable isn't required on RHEL8 but a little unsure where I'd get that
I would be very surprised if the RHEL 7/8 had the same issue as WSL 1. The issue was that while the handler was executing on an alternate stack, the uc_stack
member of the ucontext_t
passed to the signal handler was filled with zeros instead of containing valid values of ss_sp
, ss_size
and ss_flags
. So maybe in your case, the other SIGSEGV handler is executed first and fiddles with these values in a way that prevents us from correctly detecting whether we are running on an alternate stack or not.
If you'd be able to run your app under lldb or gdb, you can set a breakpoint at the .NET sigsegv_handler
function and then check the contents of the uc_stack
in the ucontext_t
passed to the handler and also see if the current stack pointer register (rsp in case of x64) is on the original stack or somewhere else.
Thanks for the info - I may try debugging that on RHEL7 and see if I can see what causes it. RHEL8 has no issue, so thats why made me wonder if there was some kind of OS level issue that was fixed between the versions.
Anyway, I really appreciate you taking time to give me some info, its been really useful
No problem, I am happy to help.
Hi Dotnet team,
We've been wrestling with an interesting issue the last few weeks and whilst I think we've found a solution, its not a well documented one and there are still some unknowns. I'd like to get some more information around it if possible.
Original Issue
OS: Red Hat Enterprise Linux Server release 7.9 (Maipo) Dotnet Info
Issue: When a null reference exception was thrown the application would terminate immediately, with an error about stack smashing and the core dump created mentions a segmentation fault.
Likely linked Issues: https://github.com/dotnet/runtime/issues/12891
Solutions I've found (This what I was hoping for more information on)
I found here https://lists.apache.org/thread/yts5os3r5mwy0lqmh47d2cg76v45o3c7 a suggestion to run with this environment variable defined and set to 1
COMPlus_EnableAlternateStackCheck=1
. There is also mention of a fix in this PR https://github.com/dotnet/coreclr/pull/25196/files but I can't see any mention of this env variableRunning on RH8 (Red Hat Enterprise Linux release 8.9 (Ootpa)) - We do not see the issue when running on RH8, we are using the same version of the dotnet runtime (however AspNetCore.App is not installed on this env of ours) - Any thoughts on what could be the reason behind this?