Add support to deduct overhead from benchmarks

skleanthous commented 6 years ago

Introduction

I'm trying to benchmark some large-er operations in my library that unfortunately cannot be isolated in a way that will give any confidence that it will be the same in production. This is similar to benchmarking an "AddToCollection" operation, which would need a setup that clears the collection on every operation.

Programmability proposal

What I would love is to be able to do the following in my benchmark:

    [MemoryDiagnoser]
    public class DependencyBenchmark
    {
        private IManageSessionOf<MyTempState> session;
        private object myEvent;

        public DependencyBenchmark()
            => myEvent = new
            {
                Count = 2,
                Name = "Hey"
            };

        [Overhead(RunBefore = true)] //Default. If no RunXXX properties are set, behaviour defaults to run before
        public void PreCallRequirements()
        {
            session = SessionFor<MyTempState>.StartSession();
        }

        [Overhead(RunAfter = true)]
        public void PostCallRequirements()
        {
            session.Dispose();
        }

        [Benchmark]
        public void AddEventBenchmark()
        {
            session.AddEvent(myEvent);
        }
    }

Possible process

This would be conceptually similar to a setup for method invokation I guess. However, I think they can provide consistent and accurate results with the following process:

If any overhead methods exist in a benchmark class then benchmark each overhead method on its own as if they were marked with Benchmark attribute. In the example above, Benchmark.Net would run benchmarks on both PreCallRequirements and PostCallRequirements methods indipendantly. These would result in benchmarkPre and benchmarkPost
Benchmark the method AddEventBenchmark normally, calling the two overhead methods as method invocation setup and teardown. Lets call this benchmarkMain
For the above benchmarks run, substract the time found in overhead benchmark runs from the time needed to run the method with the bechmarks to isolate the time needed by the actual method. In essesnce, display only one row with the following data:

Type	Method	Mean	Error	StdDev	Gen 0	Gen 1	Allocated
DependencyBenchmark	AddEventBenchmark	AddEventBenchmark.Mean - PreCallRequirements.Mean - PostCallRequirements.Mean	AddEventBenchmark.Error - PreCallRequirements.Error - PostCallRequirements.Error (Maybe?)	AddEventBenchmark.StdDev - PreCallRequirements.StdDev - PostCallRequirements.StdDev (Maybe?)	AddEventBenchmark.Gen0 - PreCallRequirements.Gen0 - PostCallRequirements.Gen0	AddEventBenchmark.Gen1- PreCallRequirements.Gen1 - PostCallRequirements.Gen1	AddEventBenchmark.Allocated - PreCallRequirements.Allocated - PostCallRequirements.Allocated

GloabSetup and IterationSetup methods would still be called normally in all above cases.

Benefits

Allows for better isolation
Allows to test operations that are normally larger\polluted by setup

Possible problems

No idea (but I could dig into my old Measurement system analysis textbooks to find out :| ) what is the correct mathematical way to derive standard deviation and error in the above scenario. I understand that if this proposal is to be done, these values should really be correct and reliable
Longer benchmarking times for benchmarks using this feature

PonchoPowers commented 6 years ago

I take it, as you have mentioned it yourself, that you are already aware of the Global and IterationSetup.

Do you also realise that each test is ran in isolation in a separate application generated by BenchmarkDotNet.

So by the sounds of things, what you really need to do is just refactor your code as BenchmarkDotNet already does what you are asking for.

Just do the following:

    [MemoryDiagnoser]
    public class DependencyBenchmark
    {
        private IManageSessionOf<MyTempState> session;
        private object myEvent;

        public DependencyBenchmark()
            => myEvent = new
            {
                Count = 2,
                Name = "Hey"
            };

        [GlobalSetup]
        public void GlobalSetup()
        {
            PreCallRequirements();
        }

        [Benchmark] //Default. If no RunXXX properties are set, behaviour defaults to run before
        public void PreCallRequirements()
        {
            session = SessionFor<MyTempState>.StartSession();
        }

        [Benchmark]
        public void PostCallRequirements()
        {
            session.Dispose();
        }

        [Benchmark]
        public void AddEventBenchmark()
        {
            session.AddEvent(myEvent);
        }

        [GlobalCleanup]
        public void GlobalCleanup()
        {
            PostCallRequirements();
        }
    }

skleanthous commented 6 years ago

Hello @Matthew-Bonner and thank you for your reply.

What you posted doesn't quite work for benchmarking AddEventBenchmark because in the case of an in-mem collection results will be skewed since each iteration would be affected by pre-existence of events in the session (considering that the other one is a global setup).

So in summary what I need is a test/invocation setup, and I suggested one way by which inaccuracy introduced by introducing a test/invocation setup could be "subtracted" out. To compensate for its non-existence, currently I am doing this:

        [Benchmark]
        public IManageSessionOf<MyTempState> Overhead()
        {
            session = SessionFor<MyTempState>.StartSession();
            session.Dispose();
            return session;
        }

        [Benchmark]
        public IManageSessionOf<MyTempState> AddEventBenchmark()
        {
            session = SessionFor<MyTempState>.StartSession();
            session.AddEvent(myEvent);
            session.Dispose();
            return session;
        }

And I manually subtract one from the other. I would really like to not polute my report with two unnecessary results.

skleanthous commented 6 years ago

Hello,

@Matthew-Bonner or @AndreyAkinshin : sorry for @-mentioning you guys, but if you just let me know if something like this is OK from your side and that if I create a PR this feature can/will be considered, I am more than keen to do all the work and create a PR ASAP. I also have some knowledge and a bit of experience of measurment theory (although both from soem years ago) so I'll probable be able to find the proper way to calculate sd & error in these cases.

PonchoPowers commented 6 years ago

I've made a minor contribution to the project, so I was merely seeing if what I suggested was an option to try and be helpful. :)

AndreyAkinshin commented 6 years ago

@skleanthous, @Matthew-Bonner sorry for the late reply. I don't like the idea with [Overhead(RunBefore = true)]. If you have a macrobenchmark, [IterationSetup] should work for you. If you have a microbenchmark, you should try to design it without side effects. I don't believe that math magic with different statistics for setup/benchmark/cleanup will work. If we are going to implement something like that, we need an example which shows that such approach provides better results than the classic one. We are not adding such features without a proof which demonstrates that we really need them.

On the other hand, the last example by @skleanthous (with the Overhead method which includes StartSession and Dispose) looks pretty nice. I guess that it's one of the best ways to solve your problem.

And I manually subtract one from the other.

We already have an issue for this problem: https://github.com/dotnet/BenchmarkDotNet/issues/431 (Option to Normalize Results by an Additive Baseline). It would be great if you create a PR which solves this problem. Before you start to implement this feature, we should discuss two following topics:

API. We should choose public API which should be obvious.
Summary table. We should decide how to show such results in the summary table. It should be obvious what's going on in the summary even for people who don't see the original code of the benchmarks.

adamsitnik commented 6 years ago

I have some idea how we could implement that:

Instead of generating the empty Idle method which measures the overhead of BDN we could tell the Engine to use the specified method as Idle (in BDN the result = result + overhead - overhead). It should be a very simple change.

[Overhead(Target = nameof(AddEventBenchmark))]
public IManageSessionOf<MyTempState> Overhead()
{
    session = SessionFor<MyTempState>.StartSession();
    session.Dispose();
    return session;
}

[Benchmark]
public IManageSessionOf<MyTempState> AddEventBenchmark()
{
    session = SessionFor<MyTempState>.StartSession();
    session.AddEvent(myEvent);
    session.Dispose();
    return session;
}

@AndreyAkinshin what do you think?

skleanthous commented 6 years ago

Hey @AndreyAkinshin & @adamsitnik, thank you for your replies :)

Basically as I understand things (please correct me if I'm wrong), jobs/classes that containe benchmarks are designed to benchmark cohesive behaviour with shared setup & teardown. Due to this, personally I think it is best to have overhead methods apply to all benchmarks defined in a class/job by default, although I do agree that also being able to specify a target (as in @adamsitnik example) would be highly beneficial.

My recommendation would be to define it as such:

public class JobA
{
        [Overhead]
        public IManageSessionOf<MyTempState> Overhead()
        {
            session = SessionFor<MyTempState>.StartSession();
            session.Dispose();
            return session;
        }

        [Benchmark]
        public IManageSessionOf<MyTempState> AddEventBenchmark()
        {
            session = SessionFor<MyTempState>.StartSession();
            session.AddEvent(myEvent);
            session.Dispose();
            return session;
        }
}

public class JobB
{
        [Benchmark]
        public IManageSessionOf<MyTempState> FullAddEventBenchmarl()
        {
            session = SessionFor<MyTempState>.StartSession();
            session.AddEvent(myEvent);
            session.Dispose();
            return session;
        }
}

With report showing:

               Method |       Mean | With Overhead |    Error |   StdDev |
--------------------- |-----------:|---------------|---------:|---------:|
    AddEventBenchmark |   995.9 ns |        1500us | 51.03 ns | 52.40 ns |
FullAddEventBenchmark |     2500ns |           0us | 51.03 ns | 52.40 ns |

With Overhead would only appear if there is an overhead method defined. This way reports are easilly understandable.

adamsitnik commented 6 years ago

@skleanthous our IterationSetup and IterationCleanup attributes allow you to set Taget. If Target == null the attribute applies to all benchmarks in given class. More in our docs.

skleanthous commented 6 years ago

@adamsitnik maybe I didn't write it very clearly, but what I wanted to say is that I agree with having a target, but that it shouldn't be required. (it was unclear to me from your message)

adamsitnik commented 6 years ago

I wanted to say is that I agree with having a target, but that it shouldn't be required.

@skleanthous so we both agree and now just need @AndreyAkinshin confirmation about my Idle idea. Andrey is getting back from the MVP summit, I am not sure when he is going to respond

AndreyAkinshin commented 6 years ago

@adamsitnik, @skleanthous, this idea sounds pretty interesting. However, it will not be so easy to implement it. Right now we apply special kinds of rules for the Idle methods. Here are a few problems:

The Pilot stage is used only for the Main method; Idle skips the Pilot stage and use the same amount of invocation as the Main method. I'm not sure that it's an acceptable trick for non-empty methods.
Idle method has known statistical properties. It allows getting Result iteration by Result[i] = Main[i] - Average(Idle[1..N]). But we can't use the average value for a non-empty method with custom user logic. Now we should work with two complex distributions; each of them could have a tricky form. We can calculate basic metrics by known formula, but I don't know how to calculate some "advanced" metrics like P99 or Kurtosis. Memory statistics (like GC0/1/2) also can be spoiled (an example: both Overhead and Main method allocates memory; Main method allocates little bit more memory, but we have the same GC0/1/2 stats because of the allocation quantum; if we get the same GC0/1/2 stats for both methods and just subtract one value from another, we can get zero stats which is not a correct result).
What should we do if the Overhead method takes more time than the Main methods? Print a warning? Use our usual Max(value, 0) rule? We have issue https://github.com/dotnet/BenchmarkDotNet/issues/601 which assumes that "impossible" values (1 < CPU clock) like 0.0088 ns should be replaced by 0 ns. Should we disable this trick for custom overhead methods?

Thus, there are two main disadvantages:

The honest implementation will be very complicated.
We should emulate all metrics for the difference between distributions (each method can have huge StdDev and allocate memory). I guess such "emulated" values can confuse people in many places.

I still think that it's better to just introduce a few additional columns without any changes in the execution infrastructure.

skleanthous commented 6 years ago

@AndreyAkinshin @adamsitnik so from what I understand, in order to avoid these issues the overhead methods need to run as if they were normal benchmarks but deduct their mean from the mean of the normal benchmarks and add a collumn as per my reply here (With Overhead collumn on the bottom of the message): https://github.com/dotnet/BenchmarkDotNet/issues/654#issuecomment-372300435. Can't this be handled on the report generation level?

AndreyAkinshin commented 6 years ago

Can't this be handled on the report generation level?

Yes, we should just introduce an additional attribute and an additional column in the summary table. Probably we should replace basic columns like Mean (or other columns which we can calculate for the difference between two distributions. In order to avoid any confusions, we can denote it as Mean* or Mean^ or Mean with another additional symbol (and explain it in the legend section).

I'm also not sure about the [Ovehead] name. Probably, it makes sense to extend Baseline feature set. In https://github.com/dotnet/BenchmarkDotNet/issues/299#issuecomment-261061598, I suggested a few possible options for extending baseline features and column replacement/hiding policies.

The first thing which we should solve (before starting to actually implement this feature) is how the API should look like and how the summary table should look like. We have too many different users and each of them has own needs. It's really hard to design the API which satisfies everyone.

@adamsitnik, what do you think?

skleanthous commented 6 years ago

Personally, I feel the [Overhead] name is very good because it conveys a lot of information both about what it is and about what is to be expected of it (from functionality). I can understand that the relationship between this new attribute and Baseline is going to be an is-a (as in: overhead as proposed is-a type of baseline, specifically DiffAbsolute), but naming things for what they do over what they are, results in better readability imo.

Obviously, the above is just my opinion. I understand that there are already established standards in place and I will be grateful to see this functionality eventually deployed regardless of naming :)

adamsitnik commented 6 years ago

@AndreyAkinshin @skleanthous If the implementation can be simple (not changing the engine, just an extra column/few math operations before printing the results) then I am ok with it.

skleanthous commented 6 years ago

@adamsitnik & @AndreyAkinshin So is this agreed? I would like to work on this feature if this is agreed

adamsitnik commented 6 years ago

@skleanthous 👍

@AndreyAkinshin ?

AndreyAkinshin commented 6 years ago

@skleanthous, 👍 I'm still not sure about naming, but let's try your idea with [Overhead].

skleanthous commented 6 years ago

Awesome! Thank you. I'll start ASAP. Considering that I have tons of work at my day job, my own open source project that I maintain and (most importantly) a family, any pointers on where to start/focus would be greatly appreciated.

Just assign me this ticket and I'll start immediately.

AndreyAkinshin commented 6 years ago

@skleanthous,

Let's start with something like [Benchmark, Overhead]. I think that we should keep the [Benchmark] attribute on such methods because we actually going to benchmark it (otherwise, it can be confused). You can inherit [Overhead] from TargetedAttribute, so it will be possible to specify Target (it means that we are going to "apply" this overhead to a specific method).
Next, you should modify BenchmarkConverter: you should find benchmark methods with [Overhead] attribute and pass it (with target value) to our "intermediate benchmark representation"
You should introduce new columns (e.g., Mean^) and implement the math (use BaselineScaledColumn as a sample).
As soon as you get something that works, please create a WIP PR; @adamsitnik and I will check that everything is fine and provide some first feedback.
Don't forget about unit tests and the documentation.

Let me know if you have any questions.

skleanthous commented 6 years ago

@AndreyAkinshin thank you for info. I'll check it out over the weekend and I'll reply here if I need anything.

billwert commented 5 years ago

Did anyone ever do a PR for this? I was just reading some documentation @adamsitnik and imagined [Overhead] also, and was happy to find it was already discussed deeply. :)

JohannesDeml commented 3 years ago

I'm having some overhead due to method calls + dictionary lookups. From the date of this issue and #431 I guess there hasn't been any progress on them, but I feel that this additional logic can help generate more meaningful/correct statistics. Has anyone ever started working on it? If so, please let me know. I would like to continue to get this feature a reality.

I think the notion by @adamsitnik is still very sound.

[Overhead(Target = nameof(AddEventBenchmark))]
public IManageSessionOf<MyTempState> Overhead()
{
  session = SessionFor<MyTempState>.StartSession();
  session.Dispose();
  return session;
}

[Benchmark]
public IManageSessionOf<MyTempState> AddEventBenchmark()
{
  session = SessionFor<MyTempState>.StartSession();
  session.AddEvent(myEvent);
  session.Dispose();
  return session;
}

What are the current thoughts of [Benchmark, Overhead] vs. [Overhead]? I think both notations are fine, so let me know what you prefer. If noone else wants to grab the issue, I would work on it after christmas :)

dotnet / BenchmarkDotNet