dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

NRE in System.Diagnostics.Tracing.EventPipePayloadDecoder if both GC + Threading event keywords are enabled #65630

Closed sandersaares closed 2 years ago

sandersaares commented 2 years ago

Description

An unhandled NullReferenceException is thrown in System.Diagnostics.Tracing.EventPipePayloadDecoder.DecodePayload() because metadata.Parameters is null.

Reproduction Steps

Create a .NET 6 ASP.NET Core web app and replace Program.cs with:

using System.Diagnostics.Tracing;

public static class Program
{
    public static void Main(string[] args)
    {
        new Listener();

        // Precondition: web app - does not reproduce in console app (perhaps just due to fewer events happening?).
        var builder = WebApplication.CreateBuilder(args);

        var app = builder.Build();

        app.Run();
    }
}

public class Listener : EventListener
{
    private const string EventSourceName = "Microsoft-Windows-DotNETRuntime";

    const int KeywordGC = 1;
    const int KeywordThreading = 65536;

    // Precondition: both of these keywords.
    private const EventKeywords Keywords = (EventKeywords)(KeywordGC | KeywordThreading);

    protected override void OnEventSourceCreated(EventSource es)
    {
        if (es.Name == EventSourceName)
        {
            EnableEvents(es, EventLevel.Verbose, Keywords);
        }
    }
}

Ensure that no filtering is happening in IDE (e.g. Just My Code) and press F5 to run the app.

Expected behavior

App runs normally, returns a 404 result as the web pipeline is empty. Debugger does not break on an exception.

Actual behavior

Debugger immediately breaks on exception at startup.

System.NullReferenceException
  HResult=0x80004003
  Message=Object reference not set to an instance of an object.
  Source=System.Private.CoreLib
  StackTrace:
   at System.Diagnostics.Tracing.EventPipePayloadDecoder.DecodePayload(EventMetadata& metadata, ReadOnlySpan`1 payload)

  This exception was originally thrown at this call stack:
    System.Diagnostics.Tracing.EventPipePayloadDecoder.DecodePayload(ref System.Diagnostics.Tracing.EventSource.EventMetadata, System.ReadOnlySpan<byte>)

>   System.Private.CoreLib.dll!System.Diagnostics.Tracing.EventPipePayloadDecoder.DecodePayload(ref System.Diagnostics.Tracing.EventSource.EventMetadata metadata, System.ReadOnlySpan<byte> payload) Line 19   C#
    System.Private.CoreLib.dll!System.Diagnostics.Tracing.NativeRuntimeEventSource.ProcessEvent(uint eventID, uint osThreadID, System.DateTime timeStamp, System.Guid activityId, System.Guid childActivityId, System.ReadOnlySpan<byte> payload) Line 55   C#
    System.Private.CoreLib.dll!System.Diagnostics.Tracing.EventPipeEventDispatcher.DispatchEventsToEventListeners() Line 175    C#
    System.Private.CoreLib.dll!System.Threading.Tasks.Task.InnerInvoke() Line 2397  C#
    System.Private.CoreLib.dll!System.Threading.Tasks.Task..cctor.AnonymousMethod__272_0(object obj) Line 2376  C#
    System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state) Line 183  C#
    System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(ref System.Threading.Tasks.Task currentTaskSlot, System.Threading.Thread threadPoolThread) Line 2333  C#
    System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread threadPoolThread) Line 2277   C#
    System.Private.CoreLib.dll!System.Threading.Tasks.ThreadPoolTaskScheduler..cctor.AnonymousMethod__10_0(object s) Line 36    C#
    System.Private.CoreLib.dll!System.Threading.Thread.StartCallback() Line 105 C#

The entire metadata object passed to DecodePayload looks uninitialized to my eye. The broken event always seems to occupy index 65 in the array of events.

image

Exception only occurs once on startup. If I continue execution to skip past the exception, all seems well.

Original report: https://github.com/djluck/prometheus-net.DotNetRuntime/issues/65

Regression?

Comments in original report (https://github.com/djluck/prometheus-net.DotNetRuntime/issues/65) indicate it worked fine in .NET 5.

Known Workarounds

None.

Configuration

.NET 6, Windows 10 x64

Other information

No response

tommcdon commented 2 years ago

@josalem

smhoff commented 2 years ago

I see the exact same behavior, however it does not happen if I disable metrics. I turn metrics back on and every time I get the error. If I just ignore the error the application continues and works fine.

josalem commented 2 years ago

Checking in to confirm I can reproduce this behavior on Win11 with VS2022. Just to rule out this being related to VS changes, what version of VS were you using?

smhoff commented 2 years ago

VS 2022 on windows 10 and 11.

Sent from my T-Mobile 5G Device Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: John Salem @.> Sent: Wednesday, March 23, 2022 10:47:54 AM To: dotnet/runtime @.> Cc: Steve Hoff @.>; Manual @.> Subject: Re: [dotnet/runtime] NRE in System.Diagnostics.Tracing.EventPipePayloadDecoder if both GC + Threading event keywords are enabled (Issue #65630)

CAUTION: This message originated externally, please use caution when clicking on links or opening attachments!

Checking in to confirm I can reproduce this behavior on Win11 with VS2022. Just to rule out this being related to VS changes, what version of VS were you using?

— Reply to this email directly, view it on GitHubhttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fruntime%2Fissues%2F65630%23issuecomment-1076503073&data=04%7C01%7Cshoff%40manageamerica.com%7Cb65ee91c99794873b6cc08da0ce47f4b%7Cc018dadddf3246b5a934014aff27b8fe%7C1%7C0%7C637836472782296804%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=0LsXCiWAd8%2By9Zppa3gYcrwlu%2B02skByd3EQK%2Fx96r4%3D&reserved=0, or unsubscribehttps://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FARQY4C5XX3IUECDDUEX62RTVBM4KVANCNFSM5O33GKOQ&data=04%7C01%7Cshoff%40manageamerica.com%7Cb65ee91c99794873b6cc08da0ce47f4b%7Cc018dadddf3246b5a934014aff27b8fe%7C1%7C0%7C637836472782296804%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=64i4aYuh89R659%2Ba5%2FK8hKT8qwkp%2BdmFbUYDsySYRsM%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.***>

josalem commented 2 years ago

Okay, I've root caused this. It is happening regardless of the debugger being attached. The lack of a using statement on the EventListener in the sample was just causing the exception to go unnoticed because it happens on a task that is only awaited in the teardown of the listening logic. If you add a manual Dispose() or a using statement, the NRE is realized without the debugger. This has the added repercussion that you shouldn't be receiving any events on the EventListener after this happens because the Task will be in an exception state till it is awaited.

In .net6 we added a managed implementation of the thread pool. To send the same events under the same provider we had to do some duplication of event definitions so that managed code had the ability to send the same events. Every event needs to be represented in managed code for EventListener to know how to parse the events. To that end, there is a python script that generates managed function stubs with the correct event IDs and parameters. For components that transition from native to managed, these stubs would be filled in by hand. For example, the thread pool had all the events sent from managed code filled out.

After that conversion, it looks like any event with ThreadPool in its name gets left out of the code generation phase: https://github.com/dotnet/runtime/blob/0d59046238d36178e8d1f60a6509fbd6b73fc697/src/coreclr/scripts/genRuntimeEventSources.py#L88-L91

This specific event, however, is only sent inside an FCALL related to overlapped IO and wasn't implemented in managed code:

<event value="65" version="0" level="win:Verbose"  template="ThreadPoolIOWork"
                           keywords="ThreadingKeyword"
                           task="ThreadPool"
                           opcode="IOPack"
                           symbol="ThreadPoolIOPack"
                           message="$(string.RuntimePublisher.ThreadPoolIOPackEventMessage)"/>

sent from: https://github.com/dotnet/runtime/blob/3fe2959f3e0c7255a90098946e57e6e7294476b1/src/coreclr/vm/nativeoverlapped.cpp#L91-L149

As a result, managed code doesn't have any metadata for this event. When the event is dispatched to managed code, EventSource attempts to read the struct in m_eventData which was zero initialized, so it NREs.

The reason this was only repro-ing for a web app is that ASP.NET Core on Windows uses a lot of overlapped IO so it sends these events while a basic console app does not.

I have a patch that fixes this and will hopefully have a PR out very soon. We will need to backport this to .net6.

edit: This issue only presents for an in-process EventListener. Out-of-process listeners will not be causing NREs in the target runtime.

CC @kouvel @dotnet/dotnet-diag - We are going to need a better system of keeping track of what events need to be represented where. Right now, there is no validation ensuring all events are represented. There may be other events getting filtered out by genruntimeeventsource.py that aren't being represented inNativeRuntimeEventSource.*.cs. I'm not sure we should add such validation to the initial fix to keep it small for back porting, but we should have something going forward. I'm inclined to say we should check in the outputs of the gen*.py scripts and have developers manually re-run them if ClrEtwAll.man is changed.

josalem commented 2 years ago

Reopening to track backporting to .net6

tommcdon commented 2 years ago

Backported to 6.0 in https://github.com/dotnet/runtime/pull/68876