dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.12k stars 4.7k forks source link

running Task.WhenAll + ConcurrentQueue + Task.Yield() in netcoreapp2.1 is slower than netcoreapp2.0 #26121

Closed itn3000 closed 4 years ago

itn3000 commented 6 years ago

I ran the benchmark with the following cases;

  1. Task.WhenAll and only do await Task.Yield
  2. Task.WhenAll and only returns Task.CompletedTask
  3. Task.WhenAll and do await Task.Yield, and then ConcurrentQueue<int>.Enqueue 10000 times.
  4. Task.WhenAll and do ConcurrentQueue<int>.Enqueue 10000 times, then returns Task.CompletedTask

case 1,2,4 in netcoreapp2.1 are faster than or mostly same as netcoreapp2.0, but case 3 is slower than netcoreapp2.0 by 2-3 times.

here is my benchmark code and results and dotnet --info output;

.NET Core SDK (global.json を反映):
 Version:   2.1.300-rc1-008673
 Commit:    f5e3ddbe73

ランタイム環境:
 OS Name:     Windows
 OS Version:  6.3.9600
 OS Platform: Windows
 RID:         win81-x64
 Base Path:   C:\Program Files\dotnet\sdk\2.1.300-rc1-008673\

Host (useful for support):
  Version: 2.1.0-rc1
  Commit:  eb9bc92051

.NET Core SDKs installed:
  2.1.100 [C:\Program Files\dotnet\sdk]
  2.1.104 [C:\Program Files\dotnet\sdk]
  2.1.300-rc1-008673 [C:\Program Files\dotnet\sdk]

.NET Core runtimes installed:
  Microsoft.NETCore.App 2.0.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.3 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.4 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.5 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.0.6 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.1.0-rc1 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]

To install additional .NET Core runtimes or SDKs:
  https://aka.ms/dotnet-download
stephentoub commented 6 years ago

From a quick look, it looks like you're hitting differences in spinning behavior (with multiple threads all enqueueing to the ConcurrentQueue at the same time, there will be contention, and it'll end up doing some spinning).

Repro:

using System;
using System.Collections.Concurrent;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Attributes.Jobs;
using BenchmarkDotNet.Running;

[Config(typeof(MultipleRuntime))]
public class Benchmarks
{
    static void Main() => BenchmarkRunner.Run<Benchmarks>();

    [Benchmark]
    public void Repro()
    {
        var sw = new SpinWait();
        var tasks = new Task[2];

        long value = 0;
        for (int t = 0; t < tasks.Length; t++)
        {
            tasks[t] = Task.Run(() =>
            {
                for (int i = 0; i < 10_000; i++)
                {
                    while (true)
                    {
                        long cur = value;
                        if (Interlocked.CompareExchange(ref value, cur + 1, cur) == cur) break;
                        sw.SpinOnce();
                    }
                }
            });
        }
        Task.WaitAll(tasks);
    }
}

class MultipleRuntime : ManualConfig
{
    public MultipleRuntime()
    {
        Add(Job.Default.With(BenchmarkDotNet.Toolchains.CsProj.CsProjCoreToolchain.NetCoreApp21).WithTargetCount(5).WithWarmupCount(5));
        Add(Job.Default.With(BenchmarkDotNet.Toolchains.CsProj.CsProjCoreToolchain.NetCoreApp20).WithTargetCount(5).WithWarmupCount(5));
    }
}

On my machine, this results in output like:

 Method |     Toolchain |     Mean |    Error |    StdDev |
------- |-------------- |---------:|---------:|----------:|
  Repro | .NET Core 2.0 | 298.5 us | 14.24 us |  3.698 us |
  Repro | .NET Core 2.1 | 511.0 us | 68.73 us | 17.852 us |

so an ~1.75x regression in 2.1, but note that when I increase the number of tasks from 2 to 4, I get this:

 Method |     Toolchain |        Mean |       Error |      StdDev |
------- |-------------- |------------:|------------:|------------:|
  Repro | .NET Core 2.0 | 26,159.2 us | 7,210.60 us | 1,872.93 us |
  Repro | .NET Core 2.1 |    567.2 us |    70.50 us |    18.31 us |

so an ~50x improvement in 2.1.

It's very difficult to say that there's an improvement or a regression in a single microbenchmark like this when spinning is involved, as is highlighted by my example above where a small change can have a massive impact and significantly shift the balance one way or the other.

@kouvel, could you take a look, too?

kouvel commented 6 years ago

I think at least part of the issue is that the SpinWait usage in ConcurrentQueue is reaching the Sleep(1) threshold more easily because 2.1 does less spinning than 2.0, and the Sleep(1) delay is the main culprit for the difference in time. I think the spinning done in ConcurrentQueue (and in other similar cases) shouldn't do Sleep(1) at all because the delay is far too long. Either it should just spin/Sleep(0), or it should do a proper wait. I think in this case spin/Sleep(0) would be sufficient and a proper wait is not necessary. I'll do some more investigation.

kouvel commented 6 years ago

Yea that seems to be the problem. Since no single sleep(1) threshold is going to be good for every spinning scenario, maybe the overload of SpinWait.SpinOnce that allows setting the sleep(1) threshold should be exposed and also to allow disabling sleep(1).

itn3000 commented 6 years ago

dotnet/corefx#29623 seems to be targeted on 2.2.0, is it difficult to fix in 2.1? (It is not urgent for me,so there is no problem)

kouvel commented 6 years ago

It can be fixed probably by having a separate implementation of SpinWait in CoreFX, there are a bunch such cases that would all need to be tested in a similar way. If I remember correctly Sleep(1) improves throughput at high contention by removing some threads from the contention, though at the cost of significantly delaying those threads. There probably will be some tradeoff but I think at least increasing the threshold for Sleep(1) such that the issue doesn't occur at relatively low thread counts would be a worthwhile tradeoff.

kouvel commented 6 years ago

There are narrow cases where 2.1 takes many times longer than 2.0, could consider back-porting

kouvel commented 6 years ago

Should be fixed in master first then maybe ported back to 2.1

danmoseley commented 6 years ago

@kouvel if you decide to request backport, please mark this with the 2.1.1 consider label

itn3000 commented 6 years ago

I found another similar case in System.Threading.Channels. WriterTask:ReaderTask = N:1 case in netcoreapp2.1 is faster than or mostly same as netcoreapp2.0. But W:R = 1:N case in netcoreapp2.1 is slower than netcoreapp2.0(1-2 times). Is it same reason?

here is my updated test code and results.

stephentoub commented 6 years ago

Is it same reason?

That's using ConcurrentQueue under the covers, so it's effectively the same thing.

danmoseley commented 5 years ago

removing servicing-consider which is for PR's. @kouvel if you think this should be serviced, please create a PR against 2.1.x and follow the usual process...

kouvel commented 5 years ago

Still need to fix in master, after that I'll consider servicing 2.2, it wouldn't meet 2.1's bar