ValueTask doesn't inline well

benaadams commented 7 years ago

None of the method builder or methods on ValueTask seem to inline very well

Inlinee: System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1[System.Decimal].Create - value class System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1<!0> () Fail Reason: Native estimate for function size exceeds threshold.

Inlinee: System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1[System.Decimal].get_Task - instance value class System.Threading.Tasks.ValueTask`1<!0> () Fail Reason: Native estimate for function size exceeds threshold.

Inlinee: System.Runtime.CompilerServices.ValueTaskAwaiter`1[System.Decimal].GetResult - instance !0 () Fail Reason: Native estimate for function size exceeds threshold.

Inliner: TheAwaitingGame.Order+d__5.MoveNext - instance void () Inlinee: instance value class System.Threading.Tasks.ValueTask`1 () Fail Reason: Native estimate for function size exceeds threshold.

Using a ternary check will give better results than a direct await

e.g. (task.IsCompleted) ? task.Result : await task;

Note: should be IsCompletedSuccessfully though the Task.Status path doesn't inline either https://github.com/dotnet/corefx/issues/16745#issuecomment-292728779

Should the no-wait sync path be marked with aggressive inlines?

Or is the state machine going to automatically add more weight than the ternary check?

/cc @stephentoub @AndyAyersMS

benaadams commented 7 years ago

It seems awaiting .AsTask() rather than the ValueTask gives similar results to the ternary check; other than allocations shooting up https://github.com/mgravell/protobuf-net/pull/234

stephentoub commented 7 years ago

@benaadams, can you share a small code sample where it's not getting inlined? In a quick one I just threw together, the hot path with ValueTask/ValueTaskAwaiter for a synchronously completed ValueTask is entirely inlined.

benaadams commented 7 years ago

@stephentoub its this async call chain https://github.com/mgravell/protobuf-net/blob/async/src/TheAwaitingGame/Program.cs#L144-L152

Perhaps because its ValueTask<decimal>?

benaadams commented 7 years ago

Also was looking at netcoreapp1.1 rather than 2.0 if something has been resolved.

benaadams commented 7 years ago

Might be different issue than not inlining; have annotated the methods; testing all completed paths.

Type              |                 Method |       Mean |     Op/s | Allocated |
----------------- |----------------------- |-----------:|---------:|----------:|
Straight Call     |                   Sync | 12.4577 us | 80271.89 |      0 kB |
Completed Task    |              TaskAsync | 19.5186 us | 51233.10 |  24.25 kB |
ValueTask         |         ValueTaskAsync | 24.1389 us | 41426.95 |      0 kB |
Ternary Check     |  ValueTaskCheckedAsync | 20.5660 us | 48623.86 |      0 kB |
Manual Await Skip |       HandCrankedAsync | 16.3431 us | 61187.91 |      0 kB |

stephentoub commented 7 years ago

Might be different issue

Why is the expectation that ValueTask will be faster than a completed task, or am I misunderstanding the concern? I fully expect that there are overheads associated with a ValueTask, e.g. it stores multiple fields that are returned, it has an extra branch, etc. This is why when I've been asked whether every method should just return a ValueTask<T> instead of a Task<T>, my answer has never been yes and rather it depends on how often an asynchronous completion is expected, whether all T results could be cached, etc., as well as partially why there's no non-generic ValueTask (i.e. it's no better than just returning a completed task).

mattnischan commented 7 years ago

If Task<T>.FromResult() is faster than ValueTask<T>, in what cases would it be useful to use ValueTask<T>?

stephentoub commented 7 years ago

If Task.FromResult() is faster than ValueTask, in what cases would it be useful to use ValueTask?

"faster" is very subjective. FromResult() allocates a Task<T>, so while it may be "faster" to get the object then and there, you're building up GC debt that'll be paid in spades later.

rogeralsing commented 7 years ago

Has there been any benchmarks done where the GC is under pressure from other work while still benchmarking the Task vs ValueTask?

I imagine that your GC will have a max throughput on your system, and if possible, it might be worth offloading this to the stack. I imagine that it will also have a positive effect on long tail latency, as the value task perf will be more consistent with less variations over time. e.g. in Kestrel webserver.

Would that be a valid assumption to make?

stephentoub commented 7 years ago

Would that be a valid assumption to make?

That's the whole reason ValueTask<T> exists: to minimize allocation in cases where it's expected that an asynchronous method will frequently complete synchronously with arbitrary result values (e.g. T is an Int32 and possible results span the Int32 range such that it would be infeasible to cache a Task<T> for all of them). Common example would be Stream.ReadAsync: if we had it to do over today, I expect we'd have defined this to return ValueTask<int> rather than Task<int>.

mattnischan commented 7 years ago

I suspected that would be the answer, but it still strikes me that most use cases are within this particular area, where ValueTask<T> is a representation of some kind of mostly-always-fast-but-occasionally-not type of value, from, say, a data access layer with caching or something like that.

It's possible this test is just not very illustrative of when ValueTask<T> normally wins, i.e. when you're hitting the GC a lot harder than this benchmark is.

stephentoub commented 7 years ago

It's possible this test is just not very illustrative

GC impact compounds and often doesn't show up in full force in microbenchmarks. The more work you have going on in a real scenario, the more allocation affects things globally in the app.

most use cases are within this particular area

There are definitely many of them, which is why ValueTask<T> was introduced. But there are also common cases where it's not valuable, e.g. it's pretty common to have an asynchronous operation that returns a bool (imagine an asynchronous enumerable with a MoveNextAsync), and for that, both values (true and false) can be cached in Task<bool>s such that there's no benefit to using a ValueTask<bool>.

stephentoub commented 7 years ago

(All that said, if there are ways we can improve the performance of ValueTask<T>, of course we want to do that. @benaadams, I'm looking at you :smile:)

AndyAyersMS commented 7 years ago

Heuristic inlining cost estimates for ValueTask<T> where T is a value class will be dependent on T in various complex ways (sizeof(T), whether the jit thinks a T is promotable, etc). So decimal vs some other type may show inlining differences.

Heuristics should be more or less the same for 1.1 vs 2.0.

Inlining of methods that take or return value classes should provide some extra benefit because quite often caller-side struct copy/init code can be removed. But the jit's optimizer is not always able to clean these things up and so the inliner can't yet anticipate that it will do so.

benaadams commented 7 years ago

There are 4 methods that don't like inlining; am investigating best approach.

Fighting with the cli and publishing 2.0 atm 😢 , so I can look at the breakout

mattnischan commented 7 years ago

Yep, changing the ValueTask type to a reference type instead brings the inlining back to life: https://github.com/mgravell/protobuf-net/pull/235

imranbaloch commented 7 years ago

@stephentoub It will great if you write some blogs about ValueTask at https://blogs.msdn.microsoft.com/pfxteam/. Just a humble request 😀

benaadams commented 7 years ago

Can see why its doing what its doing, some of this stuff is pretty big 😄

benaadams commented 7 years ago

Will take a while to resolve this as isn't striaghtforward; @karelz want to assign to me? @stephentoub has :trollface:'d my ego into it

stephentoub commented 7 years ago

Mwahahahaha

mgravell commented 7 years ago

@stephentoub I refined and polished the tests, and I still think there's some big wins yet to be had on await for ValueTask<T>. Here's my gist that describes the scenario, and here's the headline numbers:

                                       Method |          Mean |     StdDev |       Op/s | Scaled | Scaled-StdDev |  Gen 0 | Allocated |
--------------------------------------------- |--------------:|-----------:|-----------:|-------:|--------------:|-------:|----------:|
                          int/await/valuetask | 7,056.9673 ns | 20.9213 ns |  141703.93 |  10.68 |          0.07 |      - |      0 kB |
             int/manual/valuetask/iscompleted |   712.3976 ns |  2.2813 ns | 1403710.43 |   1.08 |          0.01 |      - |      0 kB |
 int/manual/valuetask/iscompletedsuccessfully |   698.7291 ns |  1.7101 ns | 1431169.82 |   1.06 |          0.01 |      - |      0 kB |
                                     int/sync |   660.6417 ns |  4.0141 ns | 1513679.87 |   1.00 |          0.00 |      - |      0 kB |

int/sync is regular classic code without any tasks etc; the two int/manual/valuetask/* versions are sync with a check on IsCompleted / IsCompletedSuccessfully, returning an async Awaited(...) local function if not complete, otherwise using .Result. You can see they work great at 1.06-1.08 scaled cost - that's fine, perfectly acceptable.

However, the await version (int/await/valuetask) has a scaled cost of 10.68. That's... huge. Now, I can totally get around that by writing complex C#, but it really feels like there's something missing here - that cost is too high, IMO, compared to manually hacking ValueTask<T> - meaning, it isn't just ValueTask<T> here, but: the interplay with the await machinery.

stephentoub commented 7 years ago

What do you get for await'ing an already completed Task<int>?

I still think there's some big wins yet to be had on await for ValueTask

Great. Let's find them. I never said there weren't places that could be improved.

mgravell commented 7 years ago

@stephentoub that's in the gist linked - I snipped those rows here

stephentoub commented 7 years ago

that's in the gist linked

I see int/await/task, but it's obviously allocating a new object on each call. Do you have one that's not? That's what I was looking for.

the interplay with the await machinery

Have you profiled? For example, are you sure the overhead is coming from the await rather than from, say, the overhead of invoking an async method?

mgravell commented 7 years ago

I'll add a version that returns the same cached static readonly Task.FromResult(42) each time - will update in a few minutes

mgravell commented 7 years ago

with:

    const int FixedResultValue = 42;
    static readonly Task<int> FixedResultTask = Task.FromResult(FixedResultValue);
    static readonly ValueTask<int> FixedResultValueTask = new ValueTask<int>(FixedResultValue);
    internal int GetLineWorthInt32Sync() => FixedResultValue;
    internal Task<int> GetLineWorthInt32TaskAsync() => FixedResultTask;
    internal ValueTask<int> GetLineWorthInt32ValueTaskAsync() => FixedResultValueTask;

I get (net462):

                                       Method |           Mean |     StdDev |       Op/s | Scaled | Scaled-StdDev |  Gen 0 | Allocated |
--------------------------------------------- |---------------:|-----------:|-----------:|-------:|--------------:|-------:|----------:|
                               int/await/task |  5,887.0756 ns | 45.8721 ns |  169863.62 |   8.99 |          0.08 | 0.6354 |      4 kB |
                  int/manual/task/iscompleted |  1,332.3687 ns | 15.3116 ns |     750543 |   2.04 |          0.02 | 0.6357 |      4 kB |
      int/manual/task/iscompletedsuccessfully |  2,112.0864 ns |  8.5444 ns |  473465.47 |   3.23 |          0.02 | 0.6354 |      4 kB |
                          int/await/valuetask | 10,150.5720 ns | 34.2319 ns |   98516.62 |  15.50 |          0.08 |      - |      0 kB |
             int/manual/valuetask/iscompleted |  2,194.9392 ns | 10.8205 ns |  455593.48 |   3.35 |          0.02 |      - |      0 kB |
 int/manual/valuetask/iscompletedsuccessfully |  2,273.9742 ns |  8.4562 ns |  439758.72 |   3.47 |          0.02 |      - |      0 kB |
                                     int/sync |    654.6944 ns |  2.8883 ns | 1527430.11 |   1.00 |          0.00 |      - |      0 kB |

and obviously, I can't cache all the Int32 -> Task<int> possibilities :) oddly, it looks to be slower using a static readonly ValueTask<int> - cheaper to init on the stack than to copy from static field? meh, who knows!

Edit: I went back to internal ValueTask<int> GetLineWorthInt32ValueTaskAsync() => new ValueTask<int>(FixedResultValue); and the bottom lines recover:

                          int/await/valuetask | 7,121.0357 ns | 14.1383 ns |  140429.01 |  11.33 |          0.04 |      - |      0 kB |
             int/manual/valuetask/iscompleted |   667.0673 ns |  6.2433 ns |  1499099.1 |   1.06 |          0.01 |      - |      0 kB |
 int/manual/valuetask/iscompletedsuccessfully |   663.6910 ns |  3.9172 ns | 1506725.32 |   1.06 |          0.01 |      - |      0 kB |
                                     int/sync |   628.3083 ns |  1.7743 ns | 1591575.31 |   1.00 |          0.00 |      - |      0 kB |

if I had to speculate - some kind of beforefieldinit cost making it cheaper (more inlineable) to do new ValueTask<int>(const) than to touch the static field, but: that doesn't matter for this!

benaadams commented 7 years ago

await task is quite a complicated beast compared to task.IsCompletedSuccessfully as it involves creating MethodBuilder, copying stack into MethodBuilder creating awaiter, getting result from awaiter and saving and restoring the execution context vs a quick check.

Having said that, I think it can be improved, but it needs some juggling

mgravell commented 7 years ago

@benaadams I guess the point I'm getting at here is: you and me - we can hack around this and make our code work fine as though it wasn't an issue; I'm thinking of "regular .NET Jo" - who is just going to use await and wonder why they aren't getting the performance they expected

mattnischan commented 7 years ago

Right. Another part is the new .Net conventional wisdom to "async all the things", which I know I've been guilty of spreading the past couple of years as sort of a simplified version of the what we probably should be telling people. The problem is, AsyncMethodBuilder is an expensive construct. If you have a simple REST API that calls into your data access layer, you might be 5 or 6 async calls deep just to get a value that is almost always cached on the heap someplace, for instance.

The decision whether or not to use an async path is not always straightforward and has the same trade-offs as any other performance decision, which you can make if you know how it works. But we've kinda told people "if this call might do I/O even once, make it async". I've had quite a few perf tuning sessions end at the realization that AsyncMethodBuilder is the most expensive thing(s) on the call stack.

benaadams commented 7 years ago

@mattnischan AsyncMethodBuilder is already better in 2.0 https://github.com/dotnet/coreclr/pull/9471 though VaueTask doesn't have these changes; also the code paths are a little more complicated for ValueTask as its a superposition of things.

Few things to watch for:

how many locals you are using in the async function scope as that's a direct driver of the size of the async state machine struct.
once you've gone async you've paid the cost so its better push up and collect as many awaits in a function that is already async rather than push them down and be very fine grained; then the cost is only paid once.

The async cost isn't very high; but if a function isn't doing much then it will dominate.

While I do think the await performance can be still improved; I must also marvel at the masterpiece of engineering it is. If I make it better, it must be recognised its only through standing on the shoulders of giants.

benaadams commented 7 years ago

Wondering if a level of exception handling can be removed to allow inlining of AsyncMethodBuilder.Start https://github.com/dotnet/coreclr/issues/11156

benaadams commented 7 years ago

Have tried lots of variations; some gains though for increased complexity. I don't know if it can be done in coreclr/fx (short of Jit magic); I think the state machine roslyn generates might need tweaking instead 😢

benaadams commented 7 years ago

Best pattern is local function for async; passing the parameters to it via params rather than locals, as that currently allocates (example pattern and issue pattern: https://github.com/dotnet/roslyn/issues/18946)

                               Method |          Mean |     StdDev |       Op/s | Scaled | Allocated |
------------------------------------- |--------------:|-----------:|-----------:|-------:|----------:|
                                 Sync |   599.9187 ns |  2.1857 ns | 1666892.40 |   1.00 |      0 kB |
 'ValueTask + Local Async via Params' |   668.2087 ns |  4.8867 ns | 1496538.39 |   1.11 |      0 kB |
 'ValueTask + Local Async via Locals' | 1,525.9064 ns |  9.9561 ns |  655348.19 |   2.54 |   2.45 kB |
               'ValueTask Pure Async' | 5,606.6241 ns | 18.2806 ns |  178360.45 |   9.35 |      0 kB |

However, I think some aggressive inlines need to be added to cope with large structs (which is where this started with decimal)

benaadams commented 7 years ago

@mgravell random things make benchmarking comparisons hard 😛

benaadams commented 7 years ago

Closing this as forcing the inlines doesn't seem to improve anything really 😢

karelz commented 7 years ago

Pity :(, thanks for pushing on it though @benaadams!

benaadams commented 7 years ago

Its within noise (+/-) for decimal and large structs and there are much larger factors which dominate; e.g. statemachine (create->start->movenext) but I haven't been able to find a general automatic solution.

benaadams commented 7 years ago

My best guess currently on what causes the drop is the async doing more work (obviously); but then that extra work being on memory; whereas the non-async and manual fast-path are working off registers.

An inlining doesn't solve it because the state is a very large struct beyond enregistering size and shape.

benaadams commented 6 years ago

Looks to be fixed in 2.1; ValueTask is no longer an outlier

                                 Method |     Mean |     Error |    StdDev |     Op/s |  Gen 0 | Allocated |
--------------------------------------- |---------:|----------:|----------:|---------:|-------:|----------:|
                           Sync_Decimal | 10.99 us | 0.0153 us | 0.0136 us | 90,990.3 |      - |       0 B |
                      TaskAsync_Decimal | 17.08 us | 0.0868 us | 0.0812 us | 58,557.6 | 2.1973 |  121680 B |
               TaskCheckedAsync_Decimal | 17.61 us | 0.0695 us | 0.0616 us | 56,795.1 | 2.1973 |  121680 B |
                 ValueTaskAsync_Decimal | 17.87 us | 0.0259 us | 0.0229 us | 55,972.8 |      - |       0 B |
          ValueTaskCheckedAsync_Decimal | 17.14 us | 0.0323 us | 0.0302 us | 58,327.0 |      - |       0 B |
               HandCrankedAsync_Decimal | 14.57 us | 0.0472 us | 0.0441 us | 68,617.4 |      - |       0 B |
           AssertCompletedAsync_Decimal | 15.43 us | 0.0283 us | 0.0236 us | 64,792.3 |      - |       0 B |
                       TaskAsync_Double | 15.49 us | 0.1381 us | 0.1224 us | 64,569.5 | 1.9531 |  109512 B |
                  ValueTaskAsync_Double | 15.59 us | 0.0250 us | 0.0234 us | 64,133.0 |      - |       0 B |

ghost commented 6 years ago

Out of curiosity, someone knows what fixed the issue?

dotnet / runtime

ValueTask doesn't inline well #21153