dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

Crash when NullReferenceException occurs in background thread on x64 #107026

Open rolfbjarne opened 2 weeks ago

rolfbjarne commented 2 weeks ago

Description

The app crashes after exception handling when a NullReferenceException occurs in a background thread.

Reproduction Steps

using System;
using System.Threading;

static class MainClass {
    static int Main (string [] args)
    {
        var thread = new Thread (() =>
        {
            try {
                Crash.Me ();
            } catch (Exception e) {
                Console.WriteLine ($"E: {e.Message}");
            }
            Console.WriteLine ("C");
        });
        thread.Start ();
        thread.Join ();
        Console.WriteLine ("D");

        return 0;
    }
}

public class Crash {
    public static void Me ()
    {
        Console.WriteLine ("A");
        ((object) null).ToString ();
        Console.WriteLine ("B");
    }
}

Project file:

<?xml version="1.0" encoding="utf-8"?>
<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>net8.0</TargetFramework>
    <RuntimeIdentifier>osx-x64</RuntimeIdentifier>
    <OutputType>Exe</OutputType>
    <SelfContained>true</SelfContained>
  </PropertyGroup>
</Project>

Run like this:

$ dotnet run
A
E: Object reference not set to an instance of an object.
$ echo $?
138

Two points of note here:

  1. Neither "C" nor "D" from the code is printed.
  2. The exit code is 138, which indicates the executable terminated due to signal 10 (SIGBUS)

macOS also creates a crash report: https://gist.github.com/rolfbjarne/4b6ba90b127d180a07414c18fef4b17e (which corroborates the SIGBUS termination).

The crashing thread:

Thread 1:: com.apple.rosetta.exceptionserver
0   runtime                             0x7ff7ffc97414 0x7ff7ffc93000 + 17428

While creating a smaller test case, the crashing stack trace was typically a bit different: https://gist.github.com/rolfbjarne/6d0d1ee838cdae83cfddc8970afe01ec

Thread 2 Crashed:
0   <translation info unavailable>         0x100d69ba0 ???
1   libsystem_platform.dylib            0x7ff80aafaff3 _sigtramp + 51
2   libcoreclr.dylib                       0x109cff74c SEHExceptionThread(void*) + 1580
3   libsystem_pthread.dylib             0x7ff80aacc18b _pthread_start + 99
4   libsystem_pthread.dylib             0x7ff80aac7ae3 thread_start + 15

Hopefully it's the same issue though.

Expected behavior

No crash.

Actual behavior

Crash

Regression?

Yes.

This started happening in a maestro bump here: https://github.com/xamarin/xamarin-macios/pull/21021, which at the moment is a bump from 8.0.109-servicing.24407.6 to 8.0.109-servicing.24419.10.

Known Workarounds

No response

Configuration

dotnet --info
.NET SDK:
 Version:           8.0.109
 Commit:            6e9002c2ef
 Workload version:  8.0.100-manifests.70d157ca

Runtime Environment:
 OS Name:     Mac OS X
 OS Version:  14.6
 OS Platform: Darwin
 RID:         osx-arm64
 Base Path:   /Users/rolf/work/maccore/main/xamarin-macios/builds/downloads/dotnet-sdk-8.0.109-servicing.24419.10/sdk/8.0.109/

.NET workloads installed:
 Workload version: 8.0.100-manifests.70d157ca
 [macos]
   Installation Source: SDK 8.0.100
   Manifest Version:    14.5.8059-ci.darc-main-ba8b4a5c-703d-4d22-97b2-7323315a2e65/8.0.100
   Manifest Path:       /Users/rolf/work/maccore/main/xamarin-macios/builds/downloads/dotnet-sdk-8.0.109-servicing.24419.10/sdk-manifests/8.0.100/microsoft.net.sdk.macos/WorkloadManifest.json
   Install Type:        FileBased

 [maccatalyst]
   Installation Source: SDK 8.0.100
   Manifest Version:    17.5.8059-ci.darc-main-ba8b4a5c-703d-4d22-97b2-7323315a2e65/8.0.100
   Manifest Path:       /Users/rolf/work/maccore/main/xamarin-macios/builds/downloads/dotnet-sdk-8.0.109-servicing.24419.10/sdk-manifests/8.0.100/microsoft.net.sdk.maccatalyst/WorkloadManifest.json
   Install Type:        FileBased

Host:
  Version:      8.0.8
  Architecture: arm64
  Commit:       08338fcaa5

.NET SDKs installed:
  8.0.109 [/Users/rolf/work/maccore/main/xamarin-macios/builds/downloads/dotnet-sdk-8.0.109-servicing.24419.10/sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 8.0.8 [/Users/rolf/work/maccore/main/xamarin-macios/builds/downloads/dotnet-sdk-8.0.109-servicing.24419.10/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 8.0.8 [/Users/rolf/work/maccore/main/xamarin-macios/builds/downloads/dotnet-sdk-8.0.109-servicing.24419.10/shared/Microsoft.NETCore.App]

Other information

I'm on an M1, and this only happens when building for x64. I haven't tested on an x64 machine, but it's a possibility this is related/limited to Rosetta only.

This only happens when using CoreCLR, not with MonoVM.

lewing commented 2 weeks ago

Sounds like a coreclr regression between runtime 8.0.7 and 8.0.8? The diff in https://github.com/xamarin/xamarin-macios/pull/21021 is confusing because the ref packs appear to be trailing the sdk version

mangod9 commented 2 weeks ago

Believe there are no guarantees around unhandled exceptions. @janvorli ?

rolfbjarne commented 2 weeks ago

Believe there are no guarantees around unhandled exceptions. @janvorli ?

It's handled:

} catch (Exception e) {
janvorli commented 2 weeks ago

The problem happens only under Rosetta. It was introduced by the #104818. We incorrectly leave CONTEXT_XSTATE set on the context even if the context returned by the OS didn't contain any AVX state. When later resuming execution after catch, our RtlRestoreContext attempts to set ymm registers due to the CONTEXT_XSTATE being present. And that crashes with SIGBUS, as Rosetta doesn't support AVX instructions (which are used to set the ymm registers). The issue doesn't occur on .NET 9 because ~we have added stripping the CONTEXT_XSTATE from the context before we start unwinding from it during EH recently.~ we are using a ClrRestoreNonVolatileContext which doesn't restore the ymm registers.