dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.91k stars 4.63k forks source link

[API Proposal]: Blackhole #104877

Open timcassell opened 1 month ago

timcassell commented 1 month ago

Background and motivation

A JIT intrinsic API that does nothing. Similar to Java's Blackhole. Blackhole is used to prevent dead code elimination and out-of-order instructions generated by a compiler (not by the CPU, so it doesn't emit any barriers). The actual call does nothing, and is stripped out of the assembly.

The purpose of this API is to make it easier/possible to benchmark sensitive operations. Without this API, heavier alternatives are required like calling a method with NoInlining, or writing to a field (see Consumer class in BDN), which can spoil the results of the measurement. The cost of this has already been observed (#89940) (also credit to @EgorBo for this idea in that issue).

API Proposal

namespace System.Runtime.CompilerServices;

public static class Blackhole
{
    public static void Consume<T>(scoped T value) where T : allows ref struct;
    public static void Consume<T>(scoped ref readonly T value) where T : allows ref struct;
    public static unsafe void Consume(void* value);
}

API Usage

Old code:

public class HalfCast
{
    private readonly Consumer consumer = new();

    // 4 operations per invoke averages the cost of the 4 operations
    [Benchmark(OperationsPerInvoke = 4)]
    [Arguments(6.097555E-05f, 12345.0f, 65520.0f, float.NaN)]
    public void SingleToHalfAverage(float arg1, float arg2, float arg3, float arg4)
    {
        // Oh no, the consumer's volatile write interferes with the results!
        consumer.Consume((Half) arg1);
        consumer.Consume((Half) arg2);
        consumer.Consume((Half) arg3);
        consumer.Consume((Half) arg4);
    }
}

New code:

public class HalfCast
{
    // 4 operations per invoke averages the cost of the 4 operations
    [Benchmark(OperationsPerInvoke = 4)]
    [Arguments(6.097555E-05f, 12345.0f, 65520.0f, float.NaN)]
    public void SingleToHalfAverage(float arg1, float arg2, float arg3, float arg4)
    {
        // Yay, we get accurate results!
        Blackhole.Consume((Half) arg1);
        Blackhole.Consume((Half) arg2);
        Blackhole.Consume((Half) arg3);
        Blackhole.Consume((Half) arg4);
    }
}

Alternative Designs

Without any new APIs - Pass the value to a method with NoInlining, or write to a field. Both of these have some overhead.

Alternative new APIs (discussed in comments below):

namespace System.Runtime.CompilerServices;

public static class BlackBox
{
    public static T Identity<T>(scoped T value) where T : allows ref struct;
    public static ref readonly T Identity<T>(scoped ref readonly T value) where T : allows ref struct;
    public static unsafe void* Identity(void* value);
}

Risks

No response

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @dotnet/area-system-numerics See info in area-owners.md if you want to be subscribed.

hamarb123 commented 1 month ago

I don't think there'd be any point to a T* version if we have a void* version (since the conversion is a no-op at IL level). It would also be good to have a version that takes ref T though (still with the where T : allows ref struct) so we can consume references without having to create a ByRef struct just to pass to this API or similar.

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @dotnet/area-system-runtime-compilerservices See info in area-owners.md if you want to be subscribed.

timcassell commented 1 month ago

I don't think there'd be any point to a T* version if we have a void* version (since the conversion does not need to be performed at IL level).

This is true, but the T* version is just more ergonomic. Otherwise you'd have to add the cast to your code. [Edit] I just tried it, and I was wrong. No cast is necessary. I will update the proposal to remove that.

Ideally it would just be a single API with an extra allows pointer anti-constraint, but that doesn't exist currently.

It would also be good to have a version that takes ref T though (still with the where T : allows ref struct) so we can consume references without having to create a ByRef struct just to pass to this API or similar.

That's not necessary.

ref int target = ref myArray[0];
Blackhole.Consume(target);
hamarb123 commented 1 month ago

This is true, but the T* version is just more ergonomic. Otherwise you'd have to add the cast to your code. Ideally it would just be a single API with an extra allows pointer anti-constraint, but that doesn't exist currently.

That's not true - you don't have to explicitly cast to void* (it's an implicit cast in c#).

That's not necessary.

ref int target = ref myArray[0];
Blackhole.Consume(target);

This is not behaviourally or conceptually the same as Blackhole.Consume(ref target) - the former explicitly dereferences the byref (which is not a no-op necessarily, for example it can throw in some cases), and is equivalent to (ignoring array covariance) Blackhole.Consume(myArray[0]) from the runtime's perspective. Blackhole.Consume(ref x) would consume the by-reference itself, whereas Blackhole.Consume(x) just consumes x.

timcassell commented 1 month ago

You are correct, I updated the proposal.

timcassell commented 1 month ago

Should it be in T rather than ref T? ref can go to in, but not vice-versa without Unsafe.

hamarb123 commented 1 month ago

From the runtime's perspective, in T and ref T are the same. This may lead to the issue where you pass something by in and expect the method to treat it as if the ref can't be mutated, but the runtime will most likely not implement it this way (as they're all just T& to it). I would add it to the open questions section, and leave it with ref (since it's technically more correct).

Also, ref readonly would be better suited for this than in.

jkotas commented 1 month ago

prevent dead code elimination and out-of-order instructions generated by a compiler (not by the CPU

Is this actually useful if it does not prevent out-of-order execution done by the CPU?

I think writing the value into a globally visible field to establish data dependency has better fidelity for micro-benchmarking purposes.

timcassell commented 1 month ago

Is this actually useful if it does not prevent out-of-order execution done by the CPU?

I think writing the value into a globally visible field to establish data dependency has better fidelity for micro-benchmarking purposes.

The idea is for this to be zero cost. As I understand it, barriers that prevent cpu instruction reordering are not free. Anyway, I don't think it's unrealistic for real-world code to do calculations without barriers (storing results in locals instead of fields), so the micro-benchmarks should be fine, I think.

Also, refs and ref structs cannot be written to globally visible fields, so even if we were to have that behavior, the runtime would need to treat them the same. That is currently not possible without runtime support.

jkotas commented 1 month ago

The idea is for this to be zero cost.

Why does it matter whether this is zero cost? The benchmarking frameworks typically subtract the setup costs from the absolute number. The cost of establishing the global data dependency can be included in the setup costs.

Also, refs and ref structs cannot be written to globally visible fields

Benchmark.NET can cast ref into a pointer or write the ref struct as raw bytes into a globally visible field to establish the data dependency. Yes, microbenchmarking is hard.

timcassell commented 1 month ago

Why does it matter whether this is zero cost? The benchmarking frameworks typically subtract the setup costs from the absolute number. The cost of establishing the global data dependency can be included in the setup costs.

Because this is meant to be used in the benchmark method itself, not just as part of the benchmark framework. I put an example in the OP. Here's another simple one:

foreach (var item in collection)
{
    Blackhole.Consume(item);
}

Yes, microbenchmarking is hard.

And this API should make it slightly easier.

write the ref struct as raw bytes

How?

jkotas commented 1 month ago

What are the chances that the speculative execution done by the processor is going to skew the results a lot if this API is not establishing data dependency? I suspect that the patterns you are proposing would not produce reliable microbenchmarks for modern processors.

write the ref struct as raw bytes

How?

I was originally thinking about taking address of the struct as unmanaged pointer, but it is probably not the best idea. Establishing data dependency for each field of the ref struct is probably a better solution.

timcassell commented 1 month ago

What are the chances that the speculative execution done by the processor is going to skew the results a lot if this API is not establishing data dependency? I suspect that the patterns you are proposing would not produce reliable microbenchmarks for modern processors.

I don't understand how speculative execution impacts this. In my mind, it's the same as writing the value to a local instead of a field.

Anyway, I'm curious what @EgorBo thinks of having a barrier, since this was originally his idea.

julealgon commented 1 month ago

Really dislike this name. Isn't there something more semantic that could be used instead?

Windows10CE commented 1 month ago

For further prior art (and an argument for a different signature) I'd like to point to Rust's std::hint::black_box which has a similar goal of inhibiting some kinds of optimizations while benchmarking.

I think this API doubling as the identity function is very useful, and allows things Blackhole does not, for example BlackBox(5u / BlackBox(2u)); generates a div instead of being completely optimized away or using a shift, or the more complex example in the Rust docs. (You can also just ignore the return value to get the same effect as the original proposed API)

I also happen to think it's a better name, but that's purely subjective.

Windows10CE commented 1 month ago

I'd also like to mention you can recreate this behavior without using NoInlining as long as the type you're talking about is supported by Volatile, see this example

timcassell commented 1 month ago

@Windows10CE That's interesting, I didn't think about the effects of constant optimization in the benchmark. Tbh, I think it's simply better to pass those constants in as arguments to the benchmark method, but I can see the usefulness of that for someone who wants to write their own raw benchmark without using a benchmarking framework.

We can't use global functions in C#, so what should the C# API be? BlackBox.Identity?

Also, too bad we don't have where T : allows pointer anti-constraint. T*** Identity<T>(T*** value). 😅

I'd also like to mention you can recreate this behavior without using NoInlining as long as the type you're talking about is supported by Volatile, see this example

Cool. I bet we could use the new Volatile.WriteBarrier() API when it gets implemented to make it work with any type (#98837).

Windows10CE commented 1 month ago

Cool. I bet we could use the new Volatile.WriteBarrier() API when it gets implemented to make it work with any type

I tried with Interlocked.MemoryBarrier() (which is strictly stronger than WriteBarrier) and it didn't work, so I'd assume that won't work either. Even that trick is liable to being optimized out at some point I think, it just works right now for anyone looking for a solution in this issue (and I believe all past versions)

I like BlackBox.Identity with the same three overloads in the original proposal if we're open to this becoming an intrinsic