[Proposal] Change the default behavior for StackOverflowException

sharwell commented 9 years ago

I would like to propose an unspecified modification to the CLR which allows code to be executed in a manner where a StackOverflowException will not cause the process to terminate. To get the discussion started, here are some possibilities for alternative behavior:

Terminate the current thread.
Terminate the current AppDomain.
Provide an attribute like HandleProcessCorruptedStateExceptionsAttribute which allows specialized handlers to be added.
Provide a static event somewhere, which allows the exception handler to execute on a different thread.

The primary use case for this I've encountered is recursive-descent algorithms for creating and visiting parse trees, e.g. in ANTLR. In these cases, I am able to write all of the behavior for a background thread such that immediate termination of a thread does not result in an invalid execution state. For example, no operation on the thread calls Monitor.Enter. These algorithms are performance-sensitive, so stack depth checking throughout the code is not a desired solution. Providing the ability to handle an exceptional stack overflow situation without terminating the process (e.g. Visual Studio, if this was running as an extension) would be a much improved situation.

MattWhilden commented 9 years ago

tldr: We get this request a lot but we do not believe that it is a recoverable scenario. Further, making it "easy" to do the wrong thing seemed more harmful than tearing down from a dangerous situation.

While undergoing a stack overflow, your process is a hairs width from doing all sorts of Really, Really bad things(tm). For instance, causing any more stack to be consumed can start scribbling over heap space (since you’ve lost the hard guard page at the end of the stack). In addition as it’s basically an asynchronous exception, object state needs to be assumed trashed (how many user objects are really designed to operate correctly if their execution is terminated while manipulating state? And for value types it’s even worse, since their copy is non-atomic).

"Just re add the guard page and move on", you may say. However, it’s not possible to restore the guard page until after you’ve decided how you’re dispatching the exception so you can unwind some frames and do your work. When you get the exception, you’re already in the last page and you can’t restore the frame until you’ve taken stuff off. And you still have to run user code in the guard page to figure out how to handle it. Sure, that could be solved by ignoring filters during Stack Overflow, but you’re just building special case on top of special case at that point. (stuff gets really weird when you have managed->native transitions there as well). And what about any other SEH handlers that have registered on your thread?

For scenarios that want to run code beyond this exception, the path forward is "host clr from native" or better, "move the disruptive chunk of code to another process".

sharwell commented 9 years ago

While undergoing a stack overflow, your process is a hairs width from doing all sorts of Really, Really bad things™.

This is a rather bad argument. One of the CLR's greatest features is the way it uses processor exceptions for things like dynamic reallocation of the evaluation stack and omitting NullReferenceException checks. If we weren't a hair's width from Really, Really Bad Things™ we would all be bored.

how many user objects are really designed to operate correctly if their execution is terminated while manipulating state?

[Subjective content here:] This is the wrong question. The better question is whether or not developers are out there who could write objects that are designed for these scenarios. While individual programming languages are an ideal platform for eliminating features because they could be abused or are likely to be misunderstood, this is not that place. The CLR should only eliminate the ability to control application behavior if its own internal state would be corrupted and (potentially) unrecoverable, or if the implementation cost of a feature exceeds the benefits it would provide.

[Objective:] In this case, the CLR already has implemented behavior allowing applications to recover (in some manner) from a StackOverflowException, so whether or not that behavior is ideal it would be possible to simply change the default initial execution environment for a managed process to enable this handling instead of the default options used today.

MattWhilden commented 9 years ago

Okay I think we've made progress toward agreement. I was perhaps thrown off by the use of the word "default". It seems obvious to me that changing the out of the box experience isn't ideal. However, if we want to have a mode that is "I'm a developer who promises to think about these things in a serious way to my serious application," that's probably fine. If can agree to that then you can achieve your goals vai hosting the clr. Check out: https://msdn.microsoft.com/en-us/library/ms164395(v=vs.110).aspx

sharwell commented 9 years ago

If can agree to that then you can achieve your goals vai hosting the clr.

Except I can't do that from within a third-party extension to another application :frowning:

OtherCrashOverride commented 9 years ago

The solution to this is to change your code/algorithm from recursive to iterative. This gives you complete control over conditions of when and what happens when you exhaust your limits.

MattWhilden commented 9 years ago

As a library developer you cannot protect your consumers from using you improperly. If the high level of stack use in your library is overly problematic, it's likely that you'll need to do the work to re-implement things iteratively. Unfortunately, there's no magic here.

We've spent a great deal of effort over the years trying to make it possible for teams to be resilient these kinds of failures (hosting API changes, CriticalFinazerObjects etc etc). It's potentially possible that one could write a collection of objects that can handle these challenges but it's the classic "thar be dragons here". IMO: In essentially all cases it's better to re-architect your code than to expect this to work coherently.

MattWhilden commented 9 years ago

@sharwell Is any of that helpful/useful? Just discouraging? :-)

I'm somewhat at a loss for other good ideas. For what it's worth it spawned a bunch of hallway conversations... If you'd like, I can tag it as hard problem. Thoughts?

kangaroo commented 9 years ago

For what my $0.02 are worth, I would suggest only offering a configurable change to the API through the hosting API. That is to say, if someone wants to break the contract in their use case, they need to explicitly opt-in from the native side, and own the fallout from that.

Not that it resolves the 'hard problem' aspect, but I think it reaches a nice middle ground where people that are willing to say 'we'll break it and pay for it' can opt-in, but its non-default.

Thoughts?

MattWhilden commented 9 years ago

Typically our stance is "we don't light up features we don't intend to fully support." This generally entails thinking about things like: how likely are we to take patches to work around issues exposed by this feature? Will this feature incur maintenance costs because of it's interaction with current/as yet envisioned features? etc etc.

@kangaroo I believe that the hosting APIs already have enough surface to do more or less what is requested (but I'm no expert). However, @sharwell has another constraint in that he would prefer it exposed in an already running managed process.

FWIW other applications internally that try to have stackoverflow guarantees use stack probing. @sharwell indicated that this caused an unacceptable amount of overhead. That's a bit surprising to me but I have no reason to assume he hasn't profiled his scenario (or some prototype) before saying as much. :-)

sharwell commented 9 years ago

indicated that this caused an unacceptable amount of overhead

It's a bit more complicated than that. Much of the overhead can be avoided, but it would be nice to avoid all of the overhead and just let the guard page exception handler take care of the problem if and only if it becomes one.

A good solution for me would be changing the default policy for EClrFailure.FAIL_StackOverflow to be EPolicyAction.eRudeAbortThread.

jcdickinson commented 9 years ago

@sharwell: it would be nice to avoid all of the overhead and just let the guard page exception handler take care of the problem if and only if it becomes one.

How much overhead do you think 'pinging' (a single wasted read/write) ESP + sizeof(.maxstack) before each call would introduce? Basically every caller makes sure that the method being called won't run out of stack space, or every callee makes sure that it won't run out of stack space. You could possibly get the StackOverflowException one or two frames in advance, but at that point there is an immense probability that your app is going to encounter it anyway.

Edit: this is a change in behavior, possibly allow it as an option in e.g. AppDomainSetup.JitOptions. This would allow someone to cordon off 'user code' (plugins that might cause this exception), while skipping the check on their own code.

ayende commented 9 years ago

Is there at least a way to check what is the current stack depth / how much stack space remains in an efficient manner?

ayende commented 9 years ago

That would allow recursive algorithms to at least throw StackOverflowImmenientExcpetion

DemiMarie commented 9 years ago

Is it possible to change the behavior while managed code is executing from the current Unix hosting API? If so, it might be possible for managed code to P/Invoke the CLR's own hosting API.

Note: This could be Undefined Behavior – please let me know if it is.

masonwheeler commented 6 years ago

One of the worst things about StackOverflowException being unrecoverable is that it seems to make it also un-debuggable. If you get one of these in the Visual Studio debugger, it will inform you that you raised a SOE, but it doesn't provide a stack trace. Can we please at the very least fix whatever's causing that?

tannergooding commented 6 years ago

I've provided the picture above as a simple example.

The exception will be caught by a debugger and viewing either the Parallel Stacks or the Call Stack window will display the appropriate information.

However, the first place people may check ($exception.StackTrace) lists the stack trace as null. I agree with @masonwheeler that ensuring the exception stacktrace information is properly populated would be ideal.

ayende commented 6 years ago

In many cases, this is actually not working. It is common for this exception to actually kill the debugger.

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

On Fri, Nov 17, 2017 at 9:17 PM, Tanner Gooding notifications@github.com wrote:

[image: image] https://user-images.githubusercontent.com/10487869/32964520-5f7701c2-cb88-11e7-9281-8fc202b96775.png

I've provided the picture above as a simple example.

The exception will be caught by a debugger and viewing either the Parallel Stacks or the Call Stack window will display the appropriate information.

However, the first place people may check ($exception.StackTrace) lists the stack trace as null. I agree with @masonwheeler https://github.com/masonwheeler that ensuring the exception stacktrace information is properly populated would be ideal.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dotnet/coreclr/issues/652#issuecomment-345339255, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHIs8LEfwecEKhR1tfUIzIh6GfkRZcUks5s3dvIgaJpZM4D7TXI .

masonwheeler commented 6 years ago

@tannergooding Not sure what to say. This is not what happens to me, on a vanilla VS2017 installation, when I get a StackOverflowException. I end up with no stack trace, no ability to evaluate locals, and no useful information, and have to track down the problem manually.

pongba commented 6 years ago

I want to resurrect the thread by adding that, besides the debuggability issue mentioned above, what's more prominent for online services are that you would like to have the chance to dump some important information before the app crashes, you don't always get to hook up the debugger and repro the issue

madelson commented 4 years ago

Java has some level of support for this, right? Is there something .NET could learn from that? If not, what is wrong with what Java does in .NET's view?

jinek commented 3 years ago

I'm proposing to consider all "fatal" exceptions and get rid of the term "fatal" which in reality can not be determined.

We can only say, that the situation is fatal when OS closes the application and it's nothing to do here with exceptions

DemiMarie commented 3 years ago

The only inherent fatal fault is memory corruption, and that should not through an exception at all, but fail fast. Everything else (including out of memory and stack overflow) can and should be recoverable.

DemiMarie commented 3 years ago

For instance, an HTTP server can recover by aborting the request and returning 500 Internal Server Error to the user.

jcdickinson commented 3 years ago

@jinek unfortunately some of these exceptions are fatal for practical reasons.

For example, how would .Net recover from an out-of-memory situation? It has already attempted to do so with a full GC collection (which always happens before an OOM), and can't run finalizers (except those that have constrained execution) because they might (and generally will) attempt to allocate. This is why .Net allocates the OOM exception before running any user code, because in an OOM condition it wouldn't be able to.

Stack overflow is unlike OOM because you can computationally anticipate a stack overflow and throw an exception prior to actually reaching the stack overflow condition (effectively making it a AboutToStackOverflowException). If you actually let the stack overflow happen, you can't guarantee a successful unwinding because finally blocks may (and generally do) call other methods.

For instance, an HTTP server can recover by aborting the request and returning 500 Internal Server Error to the user.

This would require fallible allocators in .Net, similar to how malloc can return null. You'd have to write code like this everywhere:

var foo = new Bar();
if (foo != null) {
}

Even worse:

var foo = new Bar();
if (foo != null) {
  var ex = new Http500Exception();
  if (ex == null) { /* Now what? */ }
  throw ex;
}

DemiMarie commented 3 years ago

@jcdickinson The server would look like

try {
    await handleRequest();
} catch (Object e) {// I know this isn’t valid C#, but it is valid in IL
    log(e); // native code that will not allocate
    returnInternalServerError();
}

sharwell commented 3 years ago

For example, how would .Net recover from an out-of-memory situation?

This depends on the source of the allocation that triggers the exception. Application-specific logic can recover from OutOfMemoryException under controlled scenarios. Here's an example: https://github.com/microsoft/perfview/blob/5b70079cd02ce0c590eda8eb6a0e98993b3c2228/src/HeapDump/GCHeapDumper.cs#L1131-L1162

... It has already attempted to do so with a full GC collection (which always happens before an OOM), and can't run finalizers (except those that have constrained execution) because they might (and generally will) attempt to allocate. This is why .NET allocates the OOM exception before running any user code, because in an OOM condition it wouldn't be able to. ...

This generally makes the assumption that if one allocation fails, the next one will also fail. There are many cases where this assumption breaks down, some of which lead to error scenarios that an application could treat as recoverable:

Unwinding the stack could un-root and/or un-pin variables, which can free up memory.
Not all allocations are the same size. In the case of arrays, some allocations can be much larger than others.

In general, the view of OutOfMemoryException here is faulty for the same reason the CLR's current view of StackOverflowException is faulty: given that it is difficult to reason about these exceptions (and generally impossible without additional application-specific information) it incorrectly draws a conclusion that it is not possible to code an application that exposes behavior a user would expect following an internal condition that produced this exception.

jinek commented 3 years ago

how would .Net recover from an out-of-memory situation

try
{
var data = new byte[int.Max];
}
catch(OutOfMemoryException)
{
MessageBox("File is too big, Try to close other applications and run it again";
}

Now, please, show me how you recover from System.Exception ? (It's theoretical question, because as you don't know the type of exception here I can bring up my own type which meaning would be to break your idea of recovery)

jinek commented 3 years ago

you can't guarantee a successful unwinding

We can not guarantee success execution of any catch:

catch(FormatException)
{
  return ((object)null).ToString() + "Why would we think here will be no exceptions?";
}

jinek commented 3 years ago

you can computationally anticipate a stack overflow

Heap and stack historically grew in opposite directions to be flexible on its sizes. Still, we know the size of the stack, we have system functions to check the size, we can throw SOException before the overflow allocation happens. Here is recovery example:

        private void CanNotRecoverMethod()
        {
          LocalVarToKeepSpaceForMessageBox stackReserveDontUseIt;
            try
            {
                DoFileReadingWithStackOverflow();
            }
            catch (StackoverflowException)
            {
                MessageBox(
                    "Even what is the problem that catch can not be executed? Let it throw as for any other exception");
            }
        }

        private void CanRecoverMethod()
        {
            try
            {
                CanNotRecoverMethod();
            }
            catch (StackOverflowException)
            {
                MessageBox("File is too big, try to split it using zip");
            }
        }

jcdickinson commented 3 years ago

MessageBox("File is too big, Try to close other applications and run it again";

MessageBox allocates. That's the problem. How would it allocate when there is no more memory to allocate? It would re-throw OutOfMemoryException.

DemiMarie commented 3 years ago

@jcdickinson by the time MessageBox gets to run, the stack will have unwound, allowing the garbage collector to reclaim memory.

jinek commented 3 years ago

@jcdickinson by the time MessageBox gets to run, the stack will have unwound, allowing the garbage collector to reclaim memory.

Only not garbage collector, but stack will get unwind (GC works for heap only).

Still, what I'm saying is that you can never guarantee success of catch block. Never for any exception! Still because of this we don't deny the catch block, do we?

DemiMarie commented 3 years ago

@jcdickinson by the time MessageBox gets to run, the stack will have unwound, allowing the garbage collector to reclaim memory.

Only not garbage collector, but stack will get unwind (GC works for heap only).

Still, what I'm saying is that you can never guarantee success of catch block. Never for any exception! Still because of this we don't deny the catch block, do we?

GC is for heap only, but the stack unwinding will remove some GC roots.

jinek commented 3 years ago

I think this topics are very relevant https://github.com/dotnet/runtime/issues/45256

GSPP commented 2 years ago

I can confirm that stack overflow debugging does not work, as described in this thread. RuntimeHelpers.EnsureSufficientExecutionStack seems to do nothing in my case. I'm currently in a very difficult situation since I have essentially no way of proceeding to debug.

Some ideas on how to make this better (includes ideas collected from other contributors):

Add an opt-in 64 KB guard page region. This region would allow reasonably safe execution in case of stack overflow.
Add an "I don't care whatsoever" mode that just tries to limp along as far as possible ignoring all safety principles.
Add an opt-in spare thread that can be activated in case of stack overflow. That thread would take care to at least write the stack out to the console or to a file.
As stated above, fix the debugger. But that's of no use in production.
Print the stack to the console. Apparently, this feature exists but it never activates for me. I sometimes don't even get the "Stack overflow" print message.
Use an "alternate stack" as suggested in https://github.com/dotnet/runtime/issues/40302#issuecomment-1003840882.
Add an "emergency mode" that probes for remaining stack space at each function call. If it falls below 64LB, throw. This overhead would be tolerable as part of an emergency mode useful for production troubleshooting and unit testing.
Terminate the current thread to allow the process to keep limping along ("emergency mode" only). (Suggestion from OP.)
Provide an attribute like HandleProcessCorruptedStateExceptionsAttribute which allows specialized handlers to be added. (Suggestion from OP.)
Provide a static event somewhere, which allows the exception handler to execute on a different thread. (Suggestion from OP.)

It seems that (1) would be a great way to get a production-friendly solution. If a production crash happens, the developer would enable the special mode to obtain actionable logging. And most importantly, this would fix the outage quickly!

GSPP commented 2 years ago

Another idea for an emergency mode: At each function call, probe remaining stack space. If it falls below 64 KB, throw. This overhead would be tolerable as part of an emergency mode useful for production troubleshooting and unit testing.

dotnet / runtime

[Proposal] Change the default behavior for StackOverflowException #4113