Open stephentoub opened 3 years ago
Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.
Author: | stephentoub |
---|---|
Assignees: | - |
Labels: | `area-Diagnostics-coreclr` |
Milestone: | - |
Work has been done in the past to reduce the overhead of exceptions while running under the debugger in the past, for example: https://github.com/dotnet/diagnostics/blob/0d78fe5fa58c88d4524f7698f9a1abbeba31f33a/src/inc/cordebug.idl#L3291. Just out of curiosity, is the impact measurably smaller if the exception is thrown/caught in non-user code and with Just My Code enabled?
Visual Studio logs the exception to output window.
is the impact measurably smaller if the exception is thrown/caught in non-user code and with Just My Code enabled?
I can say yes.
Just out of curiosity, is the impact measurably smaller if the exception is thrown/caught in non-user code and with Just My Code enabled?
On my machine, with JMC on:
Two thoughts --
Add background thread(s), as I suspect the time to reach GC Suspend is much better on this trivial app then in a real thing
Sure... doing what? The picture doesn't change all that much, for example, if I have a bunch of thread pool threads just looping over sleeps:
using System;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
for (int i = 0; i < Environment.ProcessorCount; i++)
{
Task.Run(() =>
{
while (true) Thread.Sleep(1);
});
}
while (true)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
try { throw new Exception(); } catch { }
}
Console.WriteLine(sw.Elapsed);
}
but if make those threads sit in tight loops:
using System;
using System.Diagnostics;
using System.Threading.Tasks;
for (int i = 0; i < Environment.ProcessorCount; i++)
{
Task.Run(() =>
{
while (true) { }
});
}
while (true)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
try { throw new Exception(); } catch { }
}
Console.WriteLine(sw.Elapsed);
}
the gap actually grows, going from ~0.0007s to ~7.3s, so a 10,000x slowdown.
Make the call stacks deeper
Part of the problem I've seen in some apps is that the number of exceptions is significantly amplified because of deeper "call stacks", and more specifically, when async is involved, because the exception ends up getting caught and thrown again at every level of the chain. For example, if I change the app to be:
using System;
using System.Diagnostics;
using System.Threading.Tasks;
while (true)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
try { await Run(10); } catch { }
}
Console.WriteLine(sw.Elapsed);
}
static async Task Run(int depth)
{
if (depth > 0)
{
await Run(depth - 1);
}
throw new Exception();
}
now ctrl-F5 shows numbers like ~0.01s and F5 shows numbers like ~28.7s.
But even with deeper non-async call stacks, the gap is still huge. I changed it to this:
using System;
using System.Diagnostics;
using System.Threading.Tasks;
while (true)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
try { Run(10); } catch { }
}
Console.WriteLine(sw.Elapsed);
}
static void Run(int depth)
{
if (depth > 0)
{
Run(depth - 1);
}
throw new Exception();
}
and now ctrl-F5 shows numbers like ~0.001s and F5 shows numbers like ~2.05s, so down to an ~2000x difference. The gap does drop further the deeper the frames go, though. At 100 frames, the difference drops to ~400x.
RE: Background threads - from GC investigations and/or telemetry do you have a sense for what suspend times for 'typical' apps look like? If so, I might try to mimic that time in whatever app we will use as we start to dig into the numbers.
and now ctrl-F5 shows numbers like ~0.001s and F5 shows numbers like ~2.05s, so down to an ~2000x difference. The gap does drop further the deeper the frames go, though. At 100 frames, the difference drops to ~400x.
To make sure I understand, you are saying that it takes significantly more time as call stack size increases (~2.05s vs. ~1.9 in your last example vs first example), but the multiplier actually falls (~2000x vs ~4500x) since the Ctrl-F5 time goes up as well. Correct?
Correct?
Yup
I don't have a repro that has as impressive a penalty as 4000x but I do have a real-world case.
It's a champion scenario for the Azure EventHubs Processor SDK that has multi-threading and network calls.
In total the startup process throws ~180 exceptions (which are in fact 32 distinct exceptions but multiplied by layers of async state machines re-throwing) and with JMC off the startup time is:
Without debugger ~7.5sec With debugger ~20sec
What was surprising to me is how large the impact of the debugging is compared to actual networking. The app does 112 blob requests during startup + various TCP connections to EventHubs service so throwing an exception in an async method costs as much as making multiple real service calls over the network.
I'm happy to provide the app with all connection strings ready or record a profile.
I spent a little time with a profiler to assess where we are spending our time. I profiled Steve's example that uses depth 10 callstacks without async:
using System;
using System.Diagnostics;
using System.Threading.Tasks;
while (true)
{
var sw = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
try { Run(10); } catch { }
}
Console.WriteLine(sw.Elapsed);
}
static void Run(int depth)
{
if (depth > 0)
{
Run(depth - 1);
}
throw new Exception();
}
On my particular machine (Windows, .Net Core 3.1.11, VS 16.8.4) using Ctrl-F5 is ~1.5ms per iteration of the outer while loop and F5 is ~1.5s, so about a 1000x overhead for me. During the 15ms handling of each exception there are 3 events dispatched from the runtime to the debugger:
This shows the breakdown of the ~15ms window for one exception
Each event logically traverses across 3 components:
My take is there are two different approaches we could follow at this point:
Personally my preference is that if we are going to spend time on this we should look into option (2), the major changes. I worry that we don't have much runway left to optimize incrementally given how easy it is for apps to start throwing yet more exceptions. I also expect building in-proc extensibility points opens the door to a bunch of other useful diagnostic scenarios. For example in the debugger space it could contribute towards other high performance events like conditional breakpoints, or being able to do more sophisticated analysis of rethrown exceptions for async. Outside debugging another example is having a dump collector that can capture dumps on 1st chance exceptions.
Priority-wise the runtime diagnostics team already has months of committed feature work for .NET 6 and a hearty backlog of bugs. I'd guess this would land in the grey area where we might be able to get it within the release timeframe but it is far enough down that we shouldn't be making any promises at this point.
Thoughts on all of this? In particular @gregg-miskelly I'd be curious to hear what you think feasibility/interest is for doing a meatier change here?
I would agree with you that there is appeal in approach 2. On the other hand, that makes the change way way more expensive. So I think it is worth at least thinking about how much we could improve without going that far.
Here are the inefficiencies that I see in the current system --
[NonUserCode]
/[DebuggerHidden]
frames in the extremely rare chance they exist. It seems like we could turn off this notification all together, and instead allow user first chance to fire more than once in the unlikely case that we stop at user first chance and discover the stopping frame should actually be considered non-user code.If you don't see any reason I am missing that we need these things -- if we took out the stack walks entirely, and reduced the number of stopping events from three to two, can you guess from the trace about what 'x' we are on? Is the main thing left synchronizing to send stopping events?
We should be able to often eliminate the stack walk
Just to confirm you are saying it would be OK if we defined a new exception callback that didn't provide the ICorDebugFrame argument or we kept the current callback and return null for that parameter? If you use that parameter for anything we probably still have to do a stackwalk in DBI.
if we took out the stack walks entirely, and reduced the number of stopping events from three to two, can you guess from the trace about what 'x' we are on? Is the main thing left synchronizing to send stopping events?
It is quite a bit more than I was anticipating we'd able to cut. I'm a little surprised we hadn't already cut those in the past : ) Rough guess maybe a 70% savings if this all held up? If we still need the stackwalk but we can drop the first chance notification that might be 50% savings.
One nice part is that an API that disables debugger exception events is part of what we'd need to build to make the in-proc filtering case useful anyways.
Rough guess maybe a 70% savings if this all held up?
I'll take it :smile:
Just to confirm you are saying it would be OK if we defined a new exception callback that didn't provide the ICorDebugFrame argument or we kept the current callback and return null for that parameter? If you use that parameter for anything we probably still have to do a stackwalk in DBI.
First, let me try to be precise for what the VS debugger would need:
Assuming that that basic level of decoding can be done without a stack walk: adding a new callback interface is fine with me.
Thanks Gregg! Its possible to get the frame information you need without a stackwalk, but it would take a good bit more refactoring. Effectively we'd be building a new set of parallel exception event handlers that captures each piece of information as the runtime does exception dispatch and then routes it all the way up to a new public ICorDebug API. I'm going to propose that if we are trying for the cheap and simple we should just add an API that lets you turn off the first chance exception handler and then assess if it is worth trying to optimize away the stackwalks as round 2.
Just to be clear, we need two items for 'round 1':
I have opened VS Task#1274770 to track the VS/vsdbg side of consuming this.
if the top frame is in a user module, but it turns out that the method has JMC attributes that make it non-user code
Are you saying that despite the presence of the NonUserCode attribute in the assembly that VS would not have previously called ICorDebugFunction2::SetJmcStatus to mark it as non-user code? If VS had previously marked it as non-user then the runtime should never generate a USER_FIRST_CHANCE event in that frame.
And thanks for mentioning it again, somehow I missed the multiple USER_FIRST_CHANCE requirement and felt silly when I re-read what you wrote above : )
Correct. VS doesn't attempt to scan through a module during module load and pre-emptively call SetJmcStatus. We do it as we encounter those methods.
Reconsider for .NET 9 per https://github.com/dotnet/runtime/issues/92856
Repro:
When I run that with ctrl-F5, I get output like:
When I run that with F5, I get output like:
That's a 4500x slowdown, with every exception throw/catch consuming ~20ms. And this does scale with exception count: if I increase the repro to 1000 exceptions, without the debugger it takes ~0.004s and with the debugger it takes ~18s.
While ideally well-behaved apps wouldn't throw lots of exceptions, this has shown to be a significant cause of slowdown for real apps while being developed, with a noticeable impact on developer inner loop performance especially if the exceptions occur at app startup.
cc: @noahfalk, @mikem8361, @gregg-miskelly