dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.46k stars 4.76k forks source link

Allow ignoring unhandled exceptions in UnhandledException event handler #42275

Closed jkotas closed 7 months ago

jkotas commented 4 years ago

Background and Motivation

Scenarios like designers or REPLs that host user provided code are not able to handle unhandled exceptions thrown by the user provided code. Unhandled exceptions on finalizer thread, threadpool threads or user created threads will take down the whole process. This is not desirable experience for these type of scenarios.

The discussion that lead to this proposal is in https://github.com/dotnet/runtime/issues/39587

Proposed API

Allow ignoring unhandled exceptions on threads created by the runtime from managed UnhandledException handler:

 namespace System
 {
     public class UnhandledExceptionEventArgs
     {
         // Existing property. Always true in .NET Core today. The behavior will be changed to:
+        // - `false` for exceptions that can be ignored (ie thread was created by the runtime)
+        // - `true` for exceptions that cannot be ignored (ie foreign thread or other situations when it is not reasonably possible to continue execution)
         // This behavior is close to .NET Framework 1.x behavior of this property.
         public bool IsTerminating { get; }

+        // The default value is false. The event handlers can set it to true to make
+        // runtime ignore the exception. It has effect only when IsTerminating is false.
+        // The documentation will come with usual disclaimer for bad consequences of ignoring exceptions
+        public bool Ignore { get; set; }
     }
 }

Usage Examples

AppDomain.CurrentDomain.UnhandledException += (sender, e) =>
{
    if (DesignMode && !e.IsTerminating)
    {
        DisplayException(e.ExceptionObject);
        e.Ignore = true;
    }
};

Alternative Designs

Unmanaged hosting API that enables this behavior. (CoreCLR has poorly documented and poorly tested configuration option for this today.)

Similar prior art:

Risks

This API can be abused to ignore unhandled exceptions in scenarios where it is not warranted.

VSadov commented 7 months ago

Here is what I would propose for the error handler if we want to pass all that could be useful.

There are many optionals. That is just the nature of this API, since there are many reasons to fail - FailFast, HW faults, exceptions, asserts, ... I am not sure if stashing arguments in a struct will make it much better.

// returns false if default handler is not desired.
extern "C" DLL_EXPORT bool __cdecl FatalErrorHandler(
         uint32_t exitCode,                       // Ex: COR_E_FAILFAST, COR_E_STACKOVERFLOW,...
         void* address = NULL,
     #ifdef WINDOWS
         PEXCEPTION_POINTERS pExceptionInfo = NULL,
     #else
         int sig = 0,
         siginfo_t* info = NULL,
         void* ucontext = NULL,
     #endif
         LPCWSTR argExceptionString=NULL,         // stack trace
         LPCWSTR message=NULL,                    // Ex: "SSE and SSE2 processor support required."
         LPCWSTR detailMessage=NULL               // Ex: "detailMessage that is passed to Assert(condition, message, detailMessage)"
         LPCWSTR errorSource=NULL                 // Ex: "Assertion Failed"
)  
{
      // native implementation with signal handler restrictions
}

As for what to do with signals if there are previous handlers - our current policy is to:

https://github.com/dotnet/runtime/blob/c1f91610c53681c7069ae1c1502a88e2644f2543/src/coreclr/nativeaot/Runtime/unix/HardwareExceptions.cpp#L546

  1. Take a look ourselves, in case we know what that is. If handled, this is not a crash and we return.
  2. Call previous handler, if exists
  3. Crash dump
  4. Return

Since our crash dumping happens after calling the previous helper, I think the overridden helper should run after as well. Basically - we should call the helper between steps 2 and 3, and if it returns false, we do not do 3.

jkotas commented 7 months ago

I am not sure if stashing arguments in a struct will make it much better.

I think we want to have a scheme where we can add more pieces of data in future. Passing the information as individual arguments in the signature makes it difficult. Instead, I think the argument should be a pointer to a struct and the first field of the struct should indicate the size. It will allow us to add more as necessary without breaking changes.

Some more comments about the details:

AaronRobinsonMSFT commented 7 months ago
     #ifdef WINDOWS
         PEXCEPTION_POINTERS pExceptionInfo = NULL,
     #else
         int sig = 0,
         siginfo_t* info = NULL,
         void* ucontext = NULL,
     #endif

The signature should be platform agnostic.

An example I would expect would be the following. Note the update return value is BOOL as opposed to C++ or C99's bool.

struct ExceptionData
{
    size_t size;
    // ...
    //  I would avoid Win32 data structures as fields.
    // Instead users can cast as needed, that should be contained within documentation.
    // ...
};

// returns false if default handler is not desired.
extern "C" DLL_EXPORT BOOL __cdecl FatalErrorHandler(int32_t error_code, struct ExceptionData* data);
VSadov commented 7 months ago

I think we want to have a scheme where we can add more pieces of data in future. Passing the information as individual arguments in the signature makes it difficult.

Passing as individual arguments was an attempt to leave door open to adding more parameters. I did not think of a struct including its size, that would work too.

VSadov commented 7 months ago

address - where does this address point to?

For faults it is the address of a fault. We have other cases (like FailFast) where we just call GetCurrentIP().

https://github.com/dotnet/runtime/blob/c1f91610c53681c7069ae1c1502a88e2644f2543/src/coreclr/vm/eepolicy.h#L55

It is basically a location in some general sense responsible for the crash.

VSadov commented 7 months ago

signo is part of siginfo_t so it is not strictly necessary to pass it as a separate field. It can be read from siginfo_t.

We have cases where we specify signo, but not the siginfo_t. https://github.com/dotnet/runtime/blob/c1f91610c53681c7069ae1c1502a88e2644f2543/src/coreclr/nativeaot/Runtime/unix/PalCreateDump.cpp#L447

Not sure if all the uses of this functionality make sense. I just tried to fit into existing patterns, thinking most are there for a reason.

VSadov commented 7 months ago

argExceptionString - if it is a stacktrace, should it be rather called stacktrace? Also, I am wondering whether it would be better to provide the stacktrace via enumerator-like callback.

HandleFatalError calls it argExceptionString. I think the only user is the Environment_FailFast and it fills it with what someException.ToString() produces. https://github.com/dotnet/runtime/blob/c1f91610c53681c7069ae1c1502a88e2644f2543/src/coreclr/classlibnative/bcltype/system.cpp#L168

So this is typically a stack trace, but ToString() can be overridden to make it not a stack trace.

Also that calls managed code, but at some point in the process of crashing we cannot call managed code any more. I guess we could still obtain and stash the stack trace as a bunch of native strings and give user an iterator, but if the most use cases will just put that all in one message or a binary dump anyways, maybe a string is good enough?

VSadov commented 7 months ago

Note the update return value is BOOL as opposed to C++ or C99's bool.

Good catch!

VSadov commented 7 months ago

So the signature could be:

// returns false if default handler is not desired.
extern "C" DLL_EXPORT BOOL __cdecl FatalErrorHandler(int32_t error_code, struct ExceptionData* data);

Now we need to settle on how ExceptionData looks like

VSadov commented 7 months ago

@AaronRobinsonMSFT Re: I would avoid Win32 data structures as fields.

Is that about PEXCEPTION_POINTERS? I think it would be ok on Windows.

As for LPWSTR - yes, we need something else. Perhaps just char* on Unix, or maybe on both Windows and Unix? UTF-8?

AaronRobinsonMSFT commented 7 months ago

Is that about PEXCEPTION_POINTERS? I think it would be ok on Windows.

I agree, but then the declarations become muddled and we end up requiring people to define complex structures. If we instead make the data structures simple the declarations are easy to copy/paste and users can simply cast in the implementation if needed. One could imagine something akin to the following:

struct ExceptionData
{
    size_t size;
    void* Context; // Cast to PEXCEPTION_POINTERS on Windows or ucontext_t* on non-Windows.
    // Define other fields and how they should be used.
};
jkotas commented 7 months ago

int32_t error_code

What kind of error_code is this?

Cast to PEXCEPTION_POINTERS on Windows or ucontext_t* on non-Windows.

EXCEPTION_POINTERS is bundle of EXCEPTION_RECORD and CONTEXT. It corresponds to siginfo_t and ucontext_t. We may want to split it into two fields like this:

struct ExceptionData
{
...
    void* info; // Cast to PEXCEPTION_RECORD on Windows or siginfo_t* on non-Windows.
    void* ucontext; // Cast to PCONTEXT on Windows or ucontext_t* on non-Windows.
...
};

ExceptionData

We should call it FatalErrorData (or FatalErrorInfo) to match SetFatalErrorHandler.

VSadov commented 7 months ago

int32_t error_code

What kind of error_code is this?

The error codes are from: C:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\Include\um\CorError.h

or, in PAL: https://github.com/dotnet/runtime/blob/2e75aac8dbbae5420a93f6fbe9568812dbeb3472/src/coreclr/pal/prebuilt/inc/corerror.h

Ex: COR_E_FAILFAST, COR_E_STACKOVERFLOW

jkotas commented 7 months ago

I would call it int32_t hresult then (to match Exception.HResult that is the same kind of error code).

VSadov commented 7 months ago

What do we do with strings, exception strings in particular?

if hostfxr as an example, the strings should be const pal::char_t*

         const pal::char_t* argExceptionString=NULL,         // whatever exception.ToString() produced.
         const pal::char_t* message=NULL,                    // Ex: "SSE and SSE2 processor support required."
         const pal::char_t* detailMessage=NULL               // detailMessage that is passed to Assert(condition, message, detailMessage)
         const pal::char_t* errorSource=NULL                 // Ex: "Assertion Failed"

The most contentious is argExceptionString:

I think we can only provide the stack when this was an unhandled managed exception and only as much as we can get from new StackTrace (Exception e).

At one extreme we could capture a snapshot of the stack and build entire infrastructure similar to StackTrace/StackFrame on top of that (modulo System.Reflection.MethodBase GetMethod()). On the other hand we can say that what StackTrace.ToString produces is parseable enough and give that out as one string.

jkotas commented 7 months ago

if hostfxr as an example, the strings should be const pal::char_t*

Yeah, char16_t vs char_t is pain to deal with. char_t is better for portability, but it may require conversions. Do you have an opinion?

The most contentious is argExceptionString: is it worth including?

I think it depends on the handler. Some handlers may want to get more detailed information (even if it means that secondary crashes can occur while gathering this extra information), some may want to get minimal information, some may want to get more detailed information only for certain types of crashes. It makes me think that we may want to have some sort of callback-based pattern to optionally gather the more detailed information. If we do that, we can start with something very minimal and evolve it based on feedback.

VSadov commented 7 months ago

Yeah, char16_t vs char_t is pain to deal with. char_t is better for portability, but it may require conversions. Do you have an opinion?

I'd prefer char_t. For portability reasons and similarity with hosting APIs.

Unlike old APIs influenced by COM and Win32, hostfxr is relatively new and some thinking about portability must have been done. I am not aware of const char_t* in its APIs causing any friction.
Crash handler is not exactly a hosting component, but in a way it is related to app's life cycle. - Another reason I looked at hosting for API shape examples.

AaronRobinsonMSFT commented 7 months ago

I'd prefer char_t. For portability reasons and similarity with hosting APIs.

I don't think this is a good idea. The hosting APIs have very different UX and rules. The hosting APIs are expecting strings that are created by the user and provided. This API is entirely in the domain of runtime provided strings, so we should not introduce any additional munging in that space and provide what we have. char16_t is far more preferable to yet another set of PAL definitions that users will need to manage.

VSadov commented 7 months ago

I don't think this is a good idea. The hosting APIs have very different UX and rules. The hosting APIs are expecting strings that are created by the user and provided. This API is entirely in the domain of runtime provided strings, so we should not introduce any additional munging in that space and provide what we have. char16_t is far more preferable to yet another set of PAL definitions that users will need to manage.

Makes sense. I did notice that unlike hosting, the strings go in a different direction (from runtime to the handler).

By char16_t - do you mean on both Windows and Unix? Any thoughts about encoding?

AaronRobinsonMSFT commented 7 months ago

By char16_t - do you mean on both Windows and Unix? Any thoughts about encoding?

My preference would be keep them in UTF-16. If we wanted to marshal them to some canonical form on a per OS basis, then I think we should go with our previous conversation and use void* and document what users should cast the pointer to. My main quibble here would be defining any PAL sounding types/APIs for such a narrow and low-level feature.

VSadov commented 7 months ago

The most contentious is argExceptionString: is it worth including?

I think it depends on the handler. Some handlers may want to get more detailed information (even if it means that secondary crashes can occur while gathering this extra information), some may want to get minimal information, some may want to get more detailed information only for certain types of crashes. It makes me think that we may want to have some sort of callback-based pattern to optionally gather the more detailed information. If we do that, we can start with something very minimal and evolve it based on feedback.

I think the managed stack trace is important and hard to get in other ways, we should add it, and should add it separately from argExceptionString even if it would often overlap, because argExceptionString comes from ToString() and could be truncated or overriden to not have stack in it.

I would produce the stacktrace in just one string though. Any kind of iterator or object model on top of that could be overthinking. It would still be in some textual form, just chunked up. So maybe more structural stack info could come in the future improvements, if there is a need.

jkotas commented 7 months ago

I think the managed stacktrace needs to be optional and only computed if the handler wants it given that computing it is a very complex operation. I do not have a strong opinion about how it should be formatted.

hard to get in other ways

This may change as we make progress on cDAC.

VSadov commented 7 months ago

I think the managed stacktrace needs to be optional and only computed if the handler wants it given that computing it is a very complex operation. I do not have a strong opinion about how it should be formatted.

Would that apply to ex.ToString() as well?

That is what we currently do in QCALLTYPE Environment_FailFast where we still can run managed code. After that it comes to the EEPolicy::HandleFatalError in a form of a string for which we've already paid.

jkotas commented 7 months ago

If the fail-fast is originating as Environment.FailFast, I do not have concerns about collecting the stacktrace upfront like we do that today.

I have concerns about computing the stacktrace automatically for failfasts that are e.g. crashes in the middle of GC. We do print the managed stacktrace to console for these today, but it is not clear whether it is a good idea. We have seen situations where it caused secondary crashes and made the original crash undiagnosable.

VSadov commented 7 months ago

Right. A managed stack is safe to get if it was a managed unhandled exception, not a random corruption. And even that may need to exclude cases like OOM.
Since Environment_FailFast covers the "safe" unhandled managed case anyways, if we get argExceptionString from it, we will pass it to the user, but otherwise can avoid any additional complex calls.

VSadov commented 7 months ago

The latest iteration that should capture all the suggestions so far. Let me know if I missed/misunderstood something or there are some additional thoughts.

The API definition:

Managed API to set up the handler

public static class ExceptionHandling
{
    /// <summary>
    /// .NET runtime is going to call `fatalErrorHandler` set by this method before its own
    /// fatal error handling (creating .NET runtime-specific crash dump, etc.).
    /// </summary>
    /// <exception cref="ArgumentNullException">If fatalErrorHandler is null</exception>
    /// <exception cref="InvalidOperationException">If a handler is already set</exception>
    public static void SetFatalErrorHandler(delegate* unmanaged<uint, void*, int> fatalErrorHandler);
}

The shape of the FatalErrorHandler, if implemented in c++ (the default calling convention for the given platform is used)

// expected signature of the handler
FatalErrorHandlerResult FatalErrorHandler(int32_t hresult, struct FatalErrorInfo* data);

With FatalErrorHandlerResult and FatalErrorInfo defined in "TBD.h" as:

enum FatalErrorHandlerResult : int32_t
{
    RunDefaultHandler = 0,
    SkipDefaultHandler = 1,
};

struct FatalErrorInfo
{
    size_t size;    // size of the FatalErrorInfo instance
    void*  address; // code location correlated with the failure (i.e. location where FailFast was called)

    // exception/signal information, if available
    void* info;     // Cast to PEXCEPTION_RECORD on Windows or siginfo_t* on non-Windows.
    void* context;  // Cast to PCONTEXT on Windows or ucontext_t* on non-Windows.

    // UTF-16 strings with additional information as appropriate
    char16_t* errorSource;     // example: "Assertion Failed"
    char16_t* message;         // error message. For example the one passed to "FailFast(message)"
    char16_t* detailMessage;   // detailMessage that was passed to "Assert(condition, message, detailMessage)"
    char16_t* exceptionString; // for unhandled managed exceptions the result of "ex.ToString()"
};

Typical use:

Setting up the handler for the app (C# code in the actual app):

internal unsafe class Program
{
    [DllImport("myCustomCrashHandler.dll")]
    public static extern delegate* unmanaged<uint, void*, int> GetFatalErrorHandler();

    static void Main(string[] args)
    {
        ExceptionHandling.SetFatalErrorHandler(GetFatalErrorHandler());

        RunMyProgram();
    }
}

The handler. (c++ in myCustomCrashHandler.dll)

#include "TBD.h"

static FatalErrorHandlerResult FatalErrorHandler(int32_t hresult, struct FatalErrorInfo* data)
{
    // this is a special handler that analyzes OOM crashes
    if (hresult != COR_E_OUTOFMEMORY)
    {
        return FatalErrorHandlerResult::RunDefaultHandler;
    }

    DoSomeCustomProcessingOfOOM(data);

    // no need for a huge crash dump after an OOM.
    return FatalErrorHandlerResult::SkipDefaultHandler;
}

extern "C" DLL_EXPORT void* GetFatalErrorHandler()
{
    return &FatalErrorHandler;
}
jkotas commented 7 months ago
    [UnmanagedCallersOnly]
    [DllImport("myCustomCrashHandler.dll")]

I do not think we allow a single method to be both UnmanagedCallersOnly and DllImport. I should be more like this:

    [DllImport("myCustomCrashHandler.dll")]
    public static extern delegate* unmanaged<...> GetFatalErrorHandler();

// returns false if default handler is not desired.

I think this can use clarification. One possible behavior:

Is it what you meant?

We may want to make it an enum to allow extending the possible actions in future. It would also avoid the need for x-plat definition of BOOL.

__cdecl

The explicit __cdecl is not in the managed signature. I would avoid the explicit calling convention, and just say that it is the platform default.

struct ExceptionData

Nit: This should be FatalErrorInfo.

VSadov commented 7 months ago

Yes to all the above. I will update the prototype in place.

VSadov commented 7 months ago

I do not think we allow a single method to be both UnmanagedCallersOnly and DllImport.

When I try that without UnmanagedCallersOnly, I get a compile error when taking an address. (&)

Error   CS8786  Calling convention of 'Program.FatalErrorHandler(uint, void*)' is not compatible with 'Unmanaged'.  

With the attribute it compiles. At least C# thinks it is needed for this scenario. (I did not check what it emits)

jkotas commented 7 months ago

Try to run:

using System.Runtime.InteropServices;

unsafe
{
    delegate* unmanaged<uint, void*, int> p = &FatalErrorHandler;
    p(0, null);

    [UnmanagedCallersOnly]
    [DllImport("myCustomCrashHandler.dll")]
    static extern int FatalErrorHandler(uint exitCode, void* fatalErrorHandler);
}

You will get: "Unhandled exception. System.NotSupportedException: Method's type signature is not PInvoke compatible."

VSadov commented 7 months ago

You will get: "Unhandled exception. System.NotSupportedException: Method's type signature is not PInvoke compatible."

I saw that, but thought it is at the call site failure - since the caller is not unmanaged. That is just an example of use though. We can come up with something else. There must be a way to fetch a pointer to a native function from managed code. Perhaps more than one.

Although I kind of wonder if we need to encode the native method signature in the API, or whether it is a function at all.

public static void SetFatalErrorHandler(delegate* unmanaged<uint, void*, int> fatalErrorHandler);

vs.

public static void SetFatalErrorHandler(void* fatalErrorHandler);

I am thinking back and forth about this.

We can assign a specific delegate type to the handler, but not sure if it brings a lot of benefits. There is void* in the parameters anyways, so it is already weakly typed.

jkotas commented 7 months ago

There must be a way to fetch a pointer to a native function from managed code. Perhaps more than one.

Although I kind of wonder if we need to encode the native method signature in the API, or whether it is a function at all.

https://learn.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.objectivec.objectivecmarshal.initialize?view=net-8.0 is recently added API with similar shape. It has the function pointers in the signature.

VSadov commented 7 months ago

I have applied all the suggestions. I think this is ready to be put in one proposal (together with managed handler part) and send to API review.

jkotas commented 7 months ago

LGTM

Nit: The example has GetFatalErrorHandler on the managed side, but it does not show it on the unmanaged side.

VSadov commented 7 months ago

Nit: The example has GetFatalErrorHandler on the managed side, but it does not show it on the unmanaged side.

And with GetFatalErrorHandler pattern the handler itself does not need to be DLL_EXPORT

VSadov commented 7 months ago

I think we want to put FatalErrorHandlerResult and FatalErrorInfo in a header file. Any existing files this that can go into?

jkotas commented 7 months ago

I would create a new small header file for this. It can live in https://github.com/dotnet/runtime/tree/main/src/native/public. I would not worry about shipping it - just tell people to copy it over from the repo if they need it.

(We have header files for debugger, profiler or hosting. Each of these are handled differently, there is not much consistency.)

VSadov commented 7 months ago

The combined proposal is at https://github.com/dotnet/runtime/issues/101560

@jkotas @AaronRobinsonMSFT - huge thanks for helping with this!!

jkotas commented 7 months ago

Superseded by https://github.com/dotnet/runtime/issues/101560