Exception throw/catch impactfully slower when debugger attached

stephentoub commented 3 years ago

Repro:

using System;
using System.Diagnostics;

while (true)
{
    var sw = Stopwatch.StartNew();
    for (int i = 0; i < 100; i++)
    {
        try { throw new Exception(); } catch { }
    }
    Console.WriteLine(sw.Elapsed);
}

When I run that with ctrl-F5, I get output like:

00:00:00.0004349
00:00:00.0004371
00:00:00.0004347
00:00:00.0004366
00:00:00.0004440

When I run that with F5, I get output like:

00:00:01.9491196
00:00:01.8550998
00:00:01.8392192
00:00:01.8323522
00:00:01.8168992

That's a 4500x slowdown, with every exception throw/catch consuming ~20ms. And this does scale with exception count: if I increase the repro to 1000 exceptions, without the debugger it takes ~0.004s and with the debugger it takes ~18s.

While ideally well-behaved apps wouldn't throw lots of exceptions, this has shown to be a significant cause of slowdown for real apps while being developed, with a noticeable impact on developer inner loop performance especially if the exceptions occur at app startup.

cc: @noahfalk, @mikem8361, @gregg-miskelly

ghost commented 3 years ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

Issue Details

Repro: ```C# using System; using System.Diagnostics; while (true) { var sw = Stopwatch.StartNew(); for (int i = 0; i < 100; i++) { try { throw new Exception(); } catch { } } Console.WriteLine(sw.Elapsed); } ``` When I run that with ctrl-F5, I get output like: ``` 00:00:00.0004349 00:00:00.0004371 00:00:00.0004347 00:00:00.0004366 00:00:00.0004440 ``` When I run that with F5, I get output like: ``` 00:00:01.9491196 00:00:01.8550998 00:00:01.8392192 00:00:01.8323522 00:00:01.8168992 ``` That's a 4500x slowdown, with every exception throw/catch consuming ~20ms. While ideally well-behaved apps wouldn't throw lots of exceptions, this has shown to be a significant cause of slowdown for real apps while being developed, with a noticeable impact on developer inner loop performance especially if the exceptions occur at app startup. cc: @noahfalk, @mikem8361, @gregg-miskelly

Author:	stephentoub
Assignees:	-
Labels:	`area-Diagnostics-coreclr`
Milestone:	-

tommcdon commented 3 years ago

Work has been done in the past to reduce the overhead of exceptions while running under the debugger in the past, for example: https://github.com/dotnet/diagnostics/blob/0d78fe5fa58c88d4524f7698f9a1abbeba31f33a/src/inc/cordebug.idl#L3291. Just out of curiosity, is the impact measurably smaller if the exception is thrown/caught in non-user code and with Just My Code enabled?

huoyaoyuan commented 3 years ago

Visual Studio logs the exception to output window.

is the impact measurably smaller if the exception is thrown/caught in non-user code and with Just My Code enabled?

I can say yes.

stephentoub commented 3 years ago

Just out of curiosity, is the impact measurably smaller if the exception is thrown/caught in non-user code and with Just My Code enabled?

On my machine, with JMC on:

if the exception is thrown from non-user code but still caught by user code, the slowdown drops from ~4500x to ~3500x.
if the exception is both thrown and caught by non-user code, the slowdown drops from ~4500x to ~10x.

gregg-miskelly commented 3 years ago

Two thoughts --

Before we get too far along attempting to improve things, it may be worth making Stephen's app a bit more "real world" -- A. Add background thread(s), as I suspect the time to reach GC Suspend is much better on this trivial app then in a real thing B. Make the call stacks deeper, as I suspect that stack walk time is going to be an important component for this, and stack walk with one function may be artificially cheap
I haven't looked at a trace, so these ideas may prove totally wrong, but I would guess improvements will fall into three buckets -- A. Improvements that the runtime can do alone. For example, making the IPCs more efficient. B. Improvements that the runtime can do along with the debugger. For example, currently the runtime will walk the stack up front. The runtime can fix this, but VS/vsdbg at least has never tried to optimize our stack handling code to avoid walking the entire stack since the runtime already had done it. So this will require changes from both sides to see the benefit. C. Improvements that will require new ICorDebug APIs -- I could imagine that getting down to the 10x-100x level would require a significantly different notification interface. Perhaps one where the notification is delivered without synchronizing, or maybe even one that is delivered in-process.

stephentoub commented 3 years ago

Add background thread(s), as I suspect the time to reach GC Suspend is much better on this trivial app then in a real thing

Sure... doing what? The picture doesn't change all that much, for example, if I have a bunch of thread pool threads just looping over sleeps:

using System;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;

for (int i = 0; i < Environment.ProcessorCount; i++)
{
    Task.Run(() =>
    {
        while (true) Thread.Sleep(1);
    });
}

while (true)
{
    var sw = Stopwatch.StartNew();
    for (int i = 0; i < 100; i++)
    {
        try { throw new Exception(); } catch { }
    }
    Console.WriteLine(sw.Elapsed);
}

but if make those threads sit in tight loops:

using System;
using System.Diagnostics;
using System.Threading.Tasks;

for (int i = 0; i < Environment.ProcessorCount; i++)
{
    Task.Run(() =>
    {
        while (true) { }
    });
}

while (true)
{
    var sw = Stopwatch.StartNew();
    for (int i = 0; i < 100; i++)
    {
        try { throw new Exception(); } catch { }
    }
    Console.WriteLine(sw.Elapsed);
}

the gap actually grows, going from ~0.0007s to ~7.3s, so a 10,000x slowdown.

Make the call stacks deeper

Part of the problem I've seen in some apps is that the number of exceptions is significantly amplified because of deeper "call stacks", and more specifically, when async is involved, because the exception ends up getting caught and thrown again at every level of the chain. For example, if I change the app to be:

using System;
using System.Diagnostics;
using System.Threading.Tasks;

while (true)
{
    var sw = Stopwatch.StartNew();
    for (int i = 0; i < 100; i++)
    {
        try { await Run(10); } catch { }
    }
    Console.WriteLine(sw.Elapsed);
}

static async Task Run(int depth)
{
    if (depth > 0)
    {
        await Run(depth - 1);
    }

    throw new Exception();
}

now ctrl-F5 shows numbers like ~0.01s and F5 shows numbers like ~28.7s.

But even with deeper non-async call stacks, the gap is still huge. I changed it to this:

using System;
using System.Diagnostics;
using System.Threading.Tasks;

while (true)
{
    var sw = Stopwatch.StartNew();
    for (int i = 0; i < 100; i++)
    {
        try { Run(10); } catch { }
    }
    Console.WriteLine(sw.Elapsed);
}

static void Run(int depth)
{
    if (depth > 0)
    {
        Run(depth - 1);
    }

    throw new Exception();
}

and now ctrl-F5 shows numbers like ~0.001s and F5 shows numbers like ~2.05s, so down to an ~2000x difference. The gap does drop further the deeper the frames go, though. At 100 frames, the difference drops to ~400x.

gregg-miskelly commented 3 years ago

RE: Background threads - from GC investigations and/or telemetry do you have a sense for what suspend times for 'typical' apps look like? If so, I might try to mimic that time in whatever app we will use as we start to dig into the numbers.

and now ctrl-F5 shows numbers like ~0.001s and F5 shows numbers like ~2.05s, so down to an ~2000x difference. The gap does drop further the deeper the frames go, though. At 100 frames, the difference drops to ~400x.

To make sure I understand, you are saying that it takes significantly more time as call stack size increases (~2.05s vs. ~1.9 in your last example vs first example), but the multiplier actually falls (~2000x vs ~4500x) since the Ctrl-F5 time goes up as well. Correct?

stephentoub commented 3 years ago

Correct?

Yup

pakrym commented 3 years ago

I don't have a repro that has as impressive a penalty as 4000x but I do have a real-world case.

It's a champion scenario for the Azure EventHubs Processor SDK that has multi-threading and network calls.

In total the startup process throws ~180 exceptions (which are in fact 32 distinct exceptions but multiplied by layers of async state machines re-throwing) and with JMC off the startup time is:

Without debugger ~7.5sec With debugger ~20sec

What was surprising to me is how large the impact of the debugging is compared to actual networking. The app does 112 blob requests during startup + various TCP connections to EventHubs service so throwing an exception in an async method costs as much as making multiple real service calls over the network.

I'm happy to provide the app with all connection strings ready or record a profile.

noahfalk commented 3 years ago

I spent a little time with a profiler to assess where we are spending our time. I profiled Steve's example that uses depth 10 callstacks without async:

using System;
using System.Diagnostics;
using System.Threading.Tasks;

while (true)
{
    var sw = Stopwatch.StartNew();
    for (int i = 0; i < 100; i++)
    {
        try { Run(10); } catch { }
    }
    Console.WriteLine(sw.Elapsed);
}

static void Run(int depth)
{
    if (depth > 0)
    {
        Run(depth - 1);
    }

    throw new Exception();
}

On my particular machine (Windows, .Net Core 3.1.11, VS 16.8.4) using Ctrl-F5 is ~1.5ms per iteration of the outer while loop and F5 is ~1.5s, so about a 1000x overhead for me. During the 15ms handling of each exception there are 3 events dispatched from the runtime to the debugger:

First chance exception - sent when the exception is thrown
User first chance exception - sent when 1st pass exception dispatch first encounters a stack frame marked as user-code
Catch handler found - send when 1st pass exception dispatch locates the frame with the catch handler

This shows the breakdown of the ~15ms window for one exception

Each event logically traverses across 3 components:

It starts in the runtime by suspending all threads and then transmitting a message over an IPC channel. This part of the work is in the white space of the graph above.
The message is received by the mscordbi library running in the msvsmon process. This part of the work is the yellow-green DBI callback thread. Most of perf cost goes to doing a stackwalk. The purpose of the stackwalk is to map a frame pointer to an ICorDebugFrame object.
mscordbi invokes a callback implemented by vsdebugeng.impl. This is where Visual Studio specific logic runs to determine what to do with the exception. vsdebugeng may also call back into mscorbi to request more information about the current state of the debuggee. This is shown in the dark green portion.

My take is there are two different approaches we could follow at this point:

We could chip away at the problem trying to make small to medium sized code changes but leaving the overall design intact. This might take a week or more of dev time, and reduce the time taken to process an exception by 10-30%. The changes that would be most impactful would be figuring out how to eliminate an entire chunk of the work, for example suppressing one of the notifications in this scenario or avoiding the stackwalk. It is unlikely that we have trivial changes available with large returns as this scenario has already been incrementally improved on more than one occasion in the past.
We could switch to a substantially different design for handling exceptions, most likely something that lets VS get their filtering logic running in-proc. This could involve at least few weeks of effort on both the CLR and VS side, but it would probably give us cost reductions >= 99%

Personally my preference is that if we are going to spend time on this we should look into option (2), the major changes. I worry that we don't have much runway left to optimize incrementally given how easy it is for apps to start throwing yet more exceptions. I also expect building in-proc extensibility points opens the door to a bunch of other useful diagnostic scenarios. For example in the debugger space it could contribute towards other high performance events like conditional breakpoints, or being able to do more sophisticated analysis of rethrown exceptions for async. Outside debugging another example is having a dump collector that can capture dumps on 1st chance exceptions.

Priority-wise the runtime diagnostics team already has months of committed feature work for .NET 6 and a hearty backlog of bugs. I'd guess this would land in the grey area where we might be able to get it within the release timeframe but it is far enough down that we shouldn't be making any promises at this point.

Thoughts on all of this? In particular @gregg-miskelly I'd be curious to hear what you think feasibility/interest is for doing a meatier change here?

gregg-miskelly commented 3 years ago

I would agree with you that there is appeal in approach 2. On the other hand, that makes the change way way more expensive. So I think it is worth at least thinking about how much we could improve without going that far.

Here are the inefficiencies that I see in the current system --

With JMC on, I believe the only thing the VS debugger does with first chance is to walk the call stack to to mark [NonUserCode]/[DebuggerHidden] frames in the extremely rare chance they exist. It seems like we could turn off this notification all together, and instead allow user first chance to fire more than once in the unlikely case that we stop at user first chance and discover the stopping frame should actually be considered non-user code.
We should be able to often eliminate the stack walk in user-first chance (JMC enabled case)/first chance (JMC disabled case) and always in catch handler found -- currently IntelliTrace uses them, so this would require some changes and decisions. But it seems like we could maybe have some throttling for user-first chance, and do something smarter with catch handler found to reuse the user-first call stack.

If you don't see any reason I am missing that we need these things -- if we took out the stack walks entirely, and reduced the number of stopping events from three to two, can you guess from the trace about what 'x' we are on? Is the main thing left synchronizing to send stopping events?

noahfalk commented 3 years ago

We should be able to often eliminate the stack walk

Just to confirm you are saying it would be OK if we defined a new exception callback that didn't provide the ICorDebugFrame argument or we kept the current callback and return null for that parameter? If you use that parameter for anything we probably still have to do a stackwalk in DBI.

if we took out the stack walks entirely, and reduced the number of stopping events from three to two, can you guess from the trace about what 'x' we are on? Is the main thing left synchronizing to send stopping events?

It is quite a bit more than I was anticipating we'd able to cut. I'm a little surprised we hadn't already cut those in the past : ) Rough guess maybe a 70% savings if this all held up? If we still need the stackwalk but we can drop the first chance notification that might be 50% savings.

One nice part is that an API that disables debugger exception events is part of what we'd need to build to make the in-proc filtering case useful anyways.

stephentoub commented 3 years ago

Rough guess maybe a 70% savings if this all held up?

I'll take it :smile:

gregg-miskelly commented 3 years ago

Just to confirm you are saying it would be OK if we defined a new exception callback that didn't provide the ICorDebugFrame argument or we kept the current callback and return null for that parameter? If you use that parameter for anything we probably still have to do a stackwalk in DBI.

First, let me try to be precise for what the VS debugger would need:

All of the notifications: obtain the current exception object, and the type
User first chance: determine the module+token of the top function to decide if the method has a non-user attribute
User unhandled: A. Determine if the handler frame is the func eval catch handler or transition frame (Managed->Unmanaged, Unmanaged->Managed, app domain) B. Determine the module+token+IL offset of the handler frame so that we can see if the entire method is non-user code, and so we can see if the catch block's IL offset is in hidden code

Assuming that that basic level of decoding can be done without a stack walk: adding a new callback interface is fine with me.

noahfalk commented 3 years ago

Thanks Gregg! Its possible to get the frame information you need without a stackwalk, but it would take a good bit more refactoring. Effectively we'd be building a new set of parallel exception event handlers that captures each piece of information as the runtime does exception dispatch and then routes it all the way up to a new public ICorDebug API. I'm going to propose that if we are trying for the cheap and simple we should just add an API that lets you turn off the first chance exception handler and then assess if it is worth trying to optimize away the stackwalks as round 2.

gregg-miskelly commented 3 years ago

Just to be clear, we need two items for 'round 1':

The ability to turn off first chance.
I have never tried this, so maybe this is already the behavior, but I doubt it -- when stopping user-first chance, if the top frame is in a user module, but it turns out that the method has JMC attributes that make it non-user code, we need to be able to mark that method as non-user code, possibly call some new API, and then call Continue. This then must generate a second user-first chance notification if the exception propagates to user code.

I have opened VS Task#1274770 to track the VS/vsdbg side of consuming this.

noahfalk commented 3 years ago

if the top frame is in a user module, but it turns out that the method has JMC attributes that make it non-user code

Are you saying that despite the presence of the NonUserCode attribute in the assembly that VS would not have previously called ICorDebugFunction2::SetJmcStatus to mark it as non-user code? If VS had previously marked it as non-user then the runtime should never generate a USER_FIRST_CHANCE event in that frame.

And thanks for mentioning it again, somehow I missed the multiple USER_FIRST_CHANCE requirement and felt silly when I re-read what you wrote above : )

gregg-miskelly commented 3 years ago

Correct. VS doesn't attempt to scan through a module during module load and pre-emptively call SetJmcStatus. We do it as we encounter those methods.

tommcdon commented 1 year ago

Reconsider for .NET 9 per https://github.com/dotnet/runtime/issues/92856

dotnet / runtime

Exception throw/catch impactfully slower when debugger attached #47617