overhead and workload invocation sequences diverge

AndyAyersMS commented 1 year ago

If these don't match then benchmark times will be either under or over estimated.

Context https://github.com/dotnet/runtime/issues/86033#issuecomment-1546306621

This is the "good" case where the jit isn't messing up the workload invoke codegen (which just makes the problem worse).

        public delegate System.Single OverheadDelegate();

        private void OverheadActionUnroll(System.Int64 invokeCount)
        {

            for (System.Int64 i = 0; i < invokeCount; i++)
            {
                consumer.Consume(overheadDelegate());

        public delegate  System.Numerics.Vector3 WorkloadDelegate();

        private void WorkloadActionUnroll(System.Int64 invokeCount)
        {

            for (System.Int64  i = 0; i < invokeCount; i++)
            {
                consumer.Consume(workloadDelegate().X);

;; overhead

       mov      rbp, gword ptr [rsi+38H]
       mov      rax, gword ptr [rsi+28H]
       mov      rcx, gword ptr [rax+08H]
       call     [rax+18H]BenchmarkDotNet.Autogenerated.Runnable_0+OverheadDelegate:Invoke():float:this
       vmovss   dword ptr [rbp+48H], xmm0

;; workload

       mov      rbp, gword ptr [rsi+38H]
       mov      rax, gword ptr [rsi+30H]
       mov      rcx, gword ptr [rax+08H]
       lea      rdx, [rsp+110H]
       call     [rax+18H]BenchmarkDotNet.Autogenerated.Runnable_0+WorkloadDelegate:Invoke():System.Numerics.Vector3:this
       vmovss   xmm0, dword ptr [rsp+110H]
       vmovss   dword ptr [rbp+48H], xmm0

May be restricted to certain return types like Vector; I haven't seen this elsewhere.

AndreyAkinshin commented 1 year ago

@AndyAyersMS Thanks for the bug report! I can confirm that it is a severe issue that affects all the benchmarks that return structs (except IntPtr, UIntPtr). The minimal repro:

public struct MyStruct
{
    public int Value;
}

public class Benchmarks
{
    [Benchmark]
    public MyStruct Foo() => new MyStruct();
}

However, it's not clear to me how to properly fix this problem. Below are some of my thoughts on this.

Firstly, we have two primary ways to consume the return value of the workload method:

consumer.Consume(workloadDelegate().Value); // Workload:1
consumer.Consume(workloadDelegate()); // Workload:2

The first one (Workload:1) is the current default. We don't use the second one (Workload:2) right now because it has object wrapping as a side-effect (Consume can natively consume only primitive types, IntPtr, UIntPtr, object).

Secondly, we have several ways to define the overhead delegate:

// Overhead:1
private System.Int32 __Overhead()
{
    return default(System.Int32);
}

// Overhead:2
private MyStruct __Overhead()
{
    return new MyStruct();
}

// Overhead:3
private MyStruct value;

private System.Int32 __Overhead()
{
    return value;
}

Here are some comments about all of these options:

Overhead:1 It is the current default: it always prefers a type that can be natively consumed by Consumer. Unfortunately, it doesn't match either Workload:1 (which has an additional .Value accessor) or Workload:2.
Overhead:2 It matches Workload:2, but it has a call of the MyStruct constructor as a side effect. If Workload doesn't create a new struct (e.g., it reads a value from a field), such an Overhead implementation is incorrect.
Overhead:3 It matches Workload:2, but it has a field reading as a side effect. If Workload doesn't read a field (e.g., it creates a new MyStruct instance), such an Overhead implementation is incorrect.

Since we don't know in advance how the given benchmark obtains the struct value, it's quite hard to provide a proper Overhead implementation that matches the Workload implementation. Therefore, BenchmarkDotNet uses Overhead:1 to reduce the risk of mismatched implementation. Unfortunately, it leads to other issues that are presented in https://github.com/dotnet/runtime/issues/86033

At the moment, I have only one idea of how to provide a proper baseline for benchmarks that return structs:

We introduce our own single-field struct in the BenchmarkDotNet template:

public struct OverheadStruct
{
    public int Value;
}

The overhead method returns a new instance of this struct:

private OverheadStruct __Overhead()
{
    return new OverheadStruct();
}

Since OverheadStruct contains a single field, it should provide a nice baseline for struct creation (at least for structs that contain at least one field).

For both overheadDelegate and overheadDelegate we pass a field value to a consumer (whenever field consuming is applicable):

        public delegate OverheadStruct OverheadDelegate();

        private void OverheadActionUnroll(System.Int64 invokeCount)
        {

            for (System.Int64 i = 0; i < invokeCount; i++)
            {
                consumer.Consume(overheadDelegate().Value);

        public delegate System.Numerics.Vector3 WorkloadDelegate();

        private void WorkloadActionUnroll(System.Int64 invokeCount)
        {
            for (System.Int64  i = 0; i < invokeCount; i++)
            {
                consumer.Consume(overheadDelegate().X);

I don't feel like it is a perfect solution, but it should (seemingly) provide a more accurate baseline for benchmarks that return structs.

@AndyAyersMS @adamsitnik What do you think?

timcassell commented 1 year ago

@AndreyAkinshin What about adding a Consumer<T> type to consume the struct as a whole? (Obviously would not work with ref struct, but neither does the current consumer.)

timcassell commented 1 year ago

it has object wrapping as a side-effect

Also, if you use Consume<T>(in T value), it does not box the value (nullables fixed in #2191).

AndreyAkinshin commented 1 year ago

@timcassell With such an approach, Consume will spend some time copying the fields of the given value to the holder struct instance. It would be another unpleasant side effect that has to be reproduced in Overhead.

timcassell commented 1 year ago

@timcassell With such an approach, Consume will spend some time copying the fields of the given value to the holder struct instance. It would be another unpleasant side effect that has to be reproduced in Overhead.

I would assume that any consume code would be replicated in overhead regardless, no?

Anyway, I was thinking of simplifying it to this.

public unsafe void Consume<T>(in T value)
    => ptrHolder = (IntPtr) Unsafe.AsPointer(ref Unsafe.AsRef(in value));

Then it won't matter how large the struct is, it's only grabbing its reference. What do you think? (current implementation passes it to DeadCodeEliminationHelper.KeepAliveWithoutBoxingReadonly)

timcassell commented 1 year ago

And I think the overhead method can be changed to this:

private MyStruct __Overhead()
{
    Unsafe.SkipInit(out MyStruct value);
    return value;
}

That matches the workload type, and avoids the cost of field reading and constructor call and zero-initializing.

AndreyAkinshin commented 1 year ago

Then it won't matter how large the struct is, it's only grabbing its reference.

Sounds good.

That matches the workload type, and avoids the cost of field reading and constructor call and zero-initializing.

It's a great idea, I like it!

Do you want to send a PR?

timcassell commented 1 year ago

Do you want to send a PR?

Sure thing.

MichalPetryka commented 1 year ago

Why is a consumer even needed here? Isn't it enough to just do a NoInline method call?

AndyAyersMS commented 1 year ago

Another thought is to not optimize these methods, that way the return value can be unconsumed but presumably would always still be produced. But it would make overhead higher, which is probably less desirable.

timcassell commented 1 year ago

Why is a consumer even needed here? Isn't it enough to just do a NoInline method call?

I thought the same thing. There was a short discussion about that in #2173.

dotnet / BenchmarkDotNet

overhead and workload invocation sequences diverge #2305