dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.47k stars 4.76k forks source link

Parallel Performance #43136

Closed sukney closed 3 years ago

sukney commented 4 years ago

The same code takes different time in .net framework4.7.2 , .net core3.1 , rust

code:

Stopwatch watch = new Stopwatch(); watch.Start();

        Parallel.For(0, 10, i =>
        {
            var x = 0;
            for (int j = 0; j < 5000000; j++)
            {
                x += 1;
            }
            Console.WriteLine("线程:{0} 完成计数", Thread.CurrentThread.ManagedThreadId);
        });

        watch.Stop();
        Console.WriteLine("耗时:{0}秒", watch.Elapsed.TotalSeconds);
        Console.Read();

Test on surface pro6

.net framework4.7.2
2 net

.net core3.1

1 netcore

rust extern crate time;

use std::thread; use time::*; fn main() { let start = time::now(); //获取开始时间 let handles: Vec<_> = (0..10) .map(|| { thread::spawn(|| { let mut x = 0; for in (0..5_000_000) { x += 1 } x }) }) .collect(); for h in handles { println!( "Thread finished with count={}", h.join().maperr(|| "Could not join a thread!").unwrap() ); }

let end = time::now(); //获取结束时间 let duration = end - start;

println!("耗时:{}", duration);

}

image

. net core 3.1 is almost twice as slow as. Net 4.7 and nearly twice as slow as rust。

ghost commented 4 years ago

Tagging subscribers to this area: @eiriktsarpalis, @jeffhandley See info in area-owners.md if you want to be subscribed.

Clockwork-Muse commented 4 years ago

The comparison with rust isn't necessarily valid, because that's a native target (which enables them to perform optimizations we can't). Depending on how you compiled it, it may actually be able to fold the loop into a constant expression.

Just running this benchmark once isn't sufficient, though. Could you rewrite your tests using something like Benchmark.net? This will run these multiple times, and give us some other stats as well.

mcDandy commented 3 years ago

This Is espetially apparent when working recursivly with trees (creating absurd number of short lived threads).

stephentoub commented 3 years ago

I don't see a regression here. I just tried the repro refactored as a benchmark.net test:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Diagnosers;
using System.Threading.Tasks;

public class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    [Benchmark]
    public void Repro()
    {
        Parallel.For(0, 10, i =>
        {
            var x = 0;
            for (int j = 0; j < 5000000; j++)
            {
                x += 1;
            }
        });
    }
}

Running it with:

dotnet run -c Release -f net48 --filter ** --runtimes net48 net60

produces:

Method Runtime Toolchain Mean Error StdDev Ratio
Repro .NET 6.0 net60 2.435 ms 0.0133 ms 0.0118 ms 0.95
Repro .NET Framework 4.8 net48 2.555 ms 0.0427 ms 0.0357 ms 1.00

I expect what you're seeing in your original repro is that .NET Core 3.1 has tiered compilation on by default, and the delegate containing the bulk of the work is only going to be executed a few times, so most of the invocations were likely with unoptimized code.