dotnet / BenchmarkDotNet

Powerful .NET library for benchmarking
https://benchmarkdotnet.org
MIT License
10.55k stars 969 forks source link

Behaviour of the dotMemory (and dotTrace) diagnosers #2628

Open stevejgordon opened 2 months ago

stevejgordon commented 2 months ago

I recently experimented with the new JetBrains diagnosers. I love the concept. However, I was surprised by how they are implemented. Right now, they attach before the WorkloadActual and detach after. This means they record all operations, which may be in the millions.

// BeforeActualRun
Target snapshot file: C:\Code\Personal\benchmarks\Test.App.Benchmarks\bin\Release\net9.0\BenchmarkDotNet.Artifacts\snapshots\Test.App.Benchmarks.AuthenticationBenchmarks.Serialize-20240829-141917.dmw
Attaching dotMemory to the process...
dotMemory is successfully attached
WorkloadActual   1: 2097152 op, 1256188200.00 ns, 598.9972 ns/op
WorkloadActual   2: 2097152 op, 1208487600.00 ns, 576.2518 ns/op
WorkloadActual   3: 2097152 op, 1206498200.00 ns, 575.3032 ns/op
WorkloadActual   4: 2097152 op, 1222480300.00 ns, 582.9240 ns/op
WorkloadActual   5: 2097152 op, 1204453500.00 ns, 574.3282 ns/op
WorkloadActual   6: 2097152 op, 1207906100.00 ns, 575.9745 ns/op
WorkloadActual   7: 2097152 op, 1216496400.00 ns, 580.0707 ns/op
WorkloadActual   8: 2097152 op, 1215363000.00 ns, 579.5302 ns/op
WorkloadActual   9: 2097152 op, 1211069100.00 ns, 577.4827 ns/op
WorkloadActual  10: 2097152 op, 1204306800.00 ns, 574.2582 ns/op
WorkloadActual  11: 2097152 op, 1218283800.00 ns, 580.9230 ns/op
WorkloadActual  12: 2097152 op, 1218864300.00 ns, 581.1998 ns/op
WorkloadActual  13: 2097152 op, 1201439400.00 ns, 572.8909 ns/op
WorkloadActual  14: 2097152 op, 1213103800.00 ns, 578.4530 ns/op
WorkloadActual  15: 2097152 op, 1191225400.00 ns, 568.0205 ns/op

// AfterActualRun
Taking dotMemory snapshot...
dotMemory snapshot is successfully taken
Detaching dotMemory from the process...
dotMemory is successfully detached

This makes their information useful but hard to utilise in my typical case. Most often, I benchmark first, and then, if I need to figure out where the saving in allocations I can potentially make, I run dotMemory over the same code. This inner loop is a little slow. I was expecting that the diagnosers would perform one invocation of the benchmark method so that the results specifically show those allocations. With the current behaviour, I have to scale things down by the number of operations. It also results in larger dotTrace and dotMemory files. Is it possible to limit the number of operations that these diagnosers analyse?

cc @AndreyAkinshin and @martinothamar

timcassell commented 2 months ago

It looks like the diagnosers hook into the HostSignal.BeforeActualRun and HostSignal.AfterActualRun events. MemoryDiagnoser, ThreadingDiagnoser, and ExceptionDiagnoser are currently hard-coded into the Engine.GetExtraStats() which runs immediately after HostSignal.AfterActualRun. Perhaps we could add another hook point for diagnosers to hook into the GetExtraStats method (and provide a way for them to get the totalOperationsCount to divide the results), and simultaneously decouple those built-in diagnosers.

martinothamar commented 2 months ago

I agree it would be nice to have both scenarios supported 👍 There is sampling to consider I guess. Depending on workload under benchmark a CPU tracer might not yield very interesting results for individual operations. Is dotMemory sampling as well?

luithefirst commented 1 month ago

I also tried the new dotMemory diagnoser in my latest benchmarking session and also found the current implementation not ideal in a scenario where there is an IterationSetup, which then is also included in the dotMemory runs. It not so easy to select the time range of a single Iteration precisely, so I would also wish for an option that only tracks one iteration. Maybe creating snapshots for each iteration might also be an option.

My second wish for future improvements would be to have an option that allows to enable full allocation tracking (instead of sampled). I would prefer to run a precise analysis in some cases.