dotnet / core

.NET news, announcements, release notes, and more!
https://dot.net
MIT License
20.97k stars 4.91k forks source link

Poor benchmark performance all .NET versions #9239

Open johnnygiter opened 7 months ago

johnnygiter commented 7 months ago

URL(s)

https://github.com/dotnet/core/tree/main

Description

Almost last place. Concering all versions Shame that compilable language is slower than Python Regius

https://programming-language-benchmarks.vercel.app/problem/edigits few others benchmark not impressive either.. https://programming-language-benchmarks.vercel.app/problem/pidigits

tannergooding commented 7 months ago

Benchmarks like these have to be taken with a grain of salt.

Both e-digits and pi-digits are not representative of real world work and are instead only representative of something that one would do purely for fun. Not only does any non-specialized use of pi not need more than around 39 digits (the number needed to compute the circumference of the observable universe to within 1 atom of precision), but there are better ways to get the first n-digits of these irrational numbers as there are actual supercomputers that have already written highly fine tuned applications that have computed 105 trillion digits and where the code looks nothing like any of the given implementations.

Additionally, the initial code for a platform is often authored in the most naive way possible. It isn't accounting for things like whether the functionality is built-in, whether it is actually executing in the language in question, whether it may be deferring down to another language where the actual core logic is running, etc. The community is then responsible for coming in and giving alternative implementations that may do better and more optimal things, with different languages getting different levels of interest. This means that the actual implementation of for two languages could be very different and make them not exactly comparable.

The way they are measuring the code also has to be accounted for, as they are not always measuring the code in identical scenarios. Some of the languages are being compiled AOT with all possible optimizations enabled, some of these compilers have built-in optimizations to recognize some of these patterns and effectively compile the code down to a near no-op (which is only possible when you have scenarios like this where everything is constant). Others are being compiled JIT and the timings are including that overhead plus not correctly accounting for other features like Tiered Compilation.


Now, the main reason the C# implementation for these two benchmarks is much slower is effectively down to the BigInteger implementation and how the code is then using it.

In Python, every number is a "bignum" and so they've spent much more time fine tuning this handling to ensure that the common cases remain performant. Their big integer type is designed to work with any code, to account for people writing the most naive possible algorithms, and has specialized compiler support to do optimizations and other logic to make it as optimized as possible.

In C#, we instead have both fixed-sized types (int, long, Int128, etc) and the arbitrarily sized System.Numerics.BigInteger. We expect that the fixed-sized types are the typical ones used and that BigInteger is only used in specialized scenarios. We likewise have reasonable assumptions in BigInteger around the maximum size we'll see and while we do support larger sizes, they've not received as much care or optimization work to make those scenarios the most efficient possible.

Correspondingly, this means that BigInteger has some design decisions around it like it being immutable and therefore (a + b) / 2 will involve allocating at twice and traversing the memory of inputs 4 times (once for a, once for b, once to create temporary that (a + b) produced, and then once more for that temporary as part of the tmp / 2 operation). The naive implementation is therefore not very efficient and there are often better ways to handle this.

There's also some nuance in that because we have many different integer types, BigInteger doesn't get as much focus since its primarily intended for specialized scenarios and not for "everyday scenarios". This does mean that we have some known optimization opportunities to make it perform better on modern hardware and even to allow users to more easily write naive logic (like done in the benchmark) without seeing terrible performance. There hasn't been an opportunity to do this optimization work as of yet, namely because there is work that is higher priority and more impactful to real world workloads to be completed in other areas first.

The other factor is that the code isn't actually taking into account how C# operates, what functionality is built into the BCL, etc. It's missing opportunities to use constants like double.Tau (while it is using these for Python and Rust), it's missing opportunities to do more efficient string formatting using span/slices, that doesn't force many copies and allocations, and so on.

johnnygiter commented 6 months ago

Simple pow on bigintegers is horribly slow comparing to other languages too (even python is much faster like over 300%)... Checked on .net6 and .net8 almost same.

using System.Numerics;

class Program
{
    static void Main()
    {

        string baseStr = "1000";
        string exponentStr = "50000";

        BigInteger baseNum = BigInteger.Parse(baseStr);
        BigInteger exponentNum = BigInteger.Parse(exponentStr);
        BigInteger result = BigInteger.Pow(baseNum, (int)exponentNum);
        Console.WriteLine($"Result: {result}");
    }
}
tannergooding commented 6 months ago

That program is:

  1. Including the overhead of parsing
  2. Not accounting for Tiered Compilation or DPGO (and therefore likely running unoptimized debug code)
  3. Doing unnecessary extra work by converting the big integer exponent back to an int

As explained above, there are known cases where the .NET BigInteger code could be improved. It was also explained that there's a lot of nuance towards benchmarking, real world perf considerations, etc. It is generally not sufficient to take a single simple program and use it as some metric of proof showing that it means X is better or faster than Y.

johnnygiter commented 5 months ago

I've chosen this benchmark for a reason, it's very poor. net.9 Preview4 make it even 12% worse

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Configs;
using System;
using System.Numerics;
using BenchmarkDotNet.Jobs;

[DisassemblyDiagnoser]
[SimpleJob(RuntimeMoniker.Net80, baseline: true)]
[SimpleJob(RuntimeMoniker.Net90)]
[GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByMethod)]
public class PerformanceTest
{
    [Benchmark]
    [Arguments(80000)]

    public string Run(int n)
    {
        var result = new System.Text.StringBuilder();
        var k = BinarySearch(n);
        var (p, q) = SumTerms(0, k - 1);
        p += q;
        var a = BigInteger.Pow(new BigInteger(10), n - 1);
        var answer = p * a / q;
        var answerStr = answer.ToString();
        Span<char> sb = stackalloc char[10];
        for (var i = 0; i < n; i += 10)
        {
            var count = i + 10;
            if (count > n)
            {
                count = n;
            }
            for (var j = i; j < i + 10; j++)
            {
                if (j < n)
                {
                    sb[j - i] = answerStr[j];
                }
                else
                {
                    sb[j - i] = ' ';
                }
            }

            result.AppendLine($"{new string(sb)}\t:{count}");
        }
        return result.ToString();
    }

    static (BigInteger, BigInteger) SumTerms(int a, int b)
    {
        if (b == a + 1)
        {
            return (new BigInteger(1), new BigInteger(b));
        }
        var mid = (a + b) / 2;
        var (pLeft, qLeft) = SumTerms(a, mid);
        var (pRight, qRight) = SumTerms(mid, b);
        return (pLeft * qRight + pRight, qLeft * qRight);
    }

    static int BinarySearch(int n)
    {
        var a = 0;
        var b = 1;
        while (!TestK(n, b))
        {
            a = b;
            b *= 2;
        }
        while (b - a > 1)
        {
            var m = (a + b) / 2;
            if (TestK(n, m))
            {
                b = m;
            }
            else
            {
                a = m;
            }
        }
        return b;
    }

    static bool TestK(int n, int k)
    {
        if (k < 0)
        {
            return false;
        }
        var lnKFactorial = k * (Math.Log((double)k) - 1) + 0.5 * Math.Log(Math.PI * 2);
        var log10KFactorial = lnKFactorial / Math.Log(10);
        return log10KFactorial >= (double)(n + 50);
    }

    public static void Main(string[] args)
    {
        var summary = BenchmarkRunner.Run<PerformanceTest>();
        Console.WriteLine(summary);
    }
}

.NET SDK 9.0.100-preview.4.24267.66 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 .NET 8.0 : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 .NET 9.0 : .NET 9.0.0 (9.0.24.26619), X64 RyuJIT AVX2

Method Job Runtime n Mean Error StdDev Ratio Code Size
Run .NET 8.0 .NET 8.0 80000 326.4 ms 0.55 ms 0.49 ms 1.00 9,827 B
Run .NET 9.0 .NET 9.0 80000 364.4 ms 0.56 ms 0.49 ms 1.12 10,435 B
tannergooding commented 5 months ago

I've chosen this benchmark for a reason, it's very poor.

There is a certain level of responsibility on the code author to write good code when they want performance.

net.9 Preview4 make it even 12% worse

This regression is from ToString, which as detailed above has known issues. It is being worked on, but isn't a huge priority compared to all the other work that could be done and a benchmark like this isn't going to change that since it isn't representative of a real world scenario someone would be likely to use in production. The primary fix for such code would be to not write things in such a naive/unoptimized fashion if performance were a consideration.


On my box, the exact code you've given (+ [MemoryDiagnoser]) reports:

Method Job Runtime n Mean Error StdDev Median Ratio RatioSD Code Size Allocated Alloc Ratio
Run .NET 8.0 .NET 8.0 80000 272.8 ms 3.10 ms 6.53 ms 269.7 ms 1.00 0.00 11,489 B 3.41 MB 1.00
Run .NET 9.0 .NET 9.0 80000 314.9 ms 6.12 ms 6.28 ms 315.9 ms 1.14 0.03 12,257 B 3.23 MB 0.94

It's then worth noting that you're skewing the benchmark (little bit of overhead, code size, and total bytes allocated) due to using the string builder in a fairly naive fashion. Changing it like this fixes that and makes it closer to the Python implementation by utilizing string interpolation, formatting, and slicing:

public IEnumerable<object[]> RunArgs()
{
    const int n = 80000;

    var lineCount = (n / 10) + 1;
    var output = new StringBuilder(
          n                                           // Number of digits
        + lineCount                                   // Number of tabs
        + lineCount                                   // Number of colons
        + (int)(lineCount * float.Log10(n))           // Number of markers (approx)
        + (lineCount * Environment.NewLine.Length)    // Number of newlines
    );

    yield return [n, output];
}

[Benchmark]
[ArgumentsSource(nameof(RunArgs))]
public void Run(int n, StringBuilder output)
{
    var k = BinarySearch(n);
    var (p, q) = SumTerms(0, k - 1);
    p += q;
    var a = BigInteger.Pow(new BigInteger(10), n - 1);
    var answer = p * a / q;
    var s = answer.ToString().AsSpan();

    for (var i = 0; i < n; i += 10)
    {
        if ((i + 10) <= n)
        {
            output.AppendLine($"{s[i..(i + 10)]}\t:{i + 10}");
        }
        else
        {
            output.AppendLine($"{s[i..],-10}\t:{n}");
        }
    }
}
Which gives us: Method Job Runtime n output Mean Error StdDev Median Ratio RatioSD Code Size Allocated Alloc Ratio
Run .NET 8.0 .NET 8.0 80000 273.5 ms 4.36 ms 10.69 ms 269.7 ms 1.00 0.00 16,196 B 2.76 MB 1.00
Run .NET 9.0 .NET 9.0 80000 313.6 ms 5.20 ms 4.61 ms 313.6 ms 1.10 0.07 6,419 B 2.57 MB 0.93

Now, it reduced allocations and code size for .NET 9, but it didn't really change the execution time because string builder itself is very fast. So we can experiment a bit and find out where the cost is coming from:

[ArgumentsSource(nameof(RunArgs))]
public string Run(int n, StringBuilder output)
{
    var k = BinarySearch(n);
    var (p, q) = SumTerms(0, k - 1);
    p += q;
    var a = BigInteger.Pow(new BigInteger(10), n - 1);
    var answer = p * a / q;
    return answer.ToString();
}
Gives us: Method Job Runtime n output Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
Run .NET 8.0 .NET 8.0 80000 272.4 ms 4.73 ms 9.46 ms 1.00 0.00 10,090 B 2.47 MB 1.00
Run .NET 9.0 .NET 9.0 80000 312.7 ms 6.24 ms 6.13 ms 1.13 0.05 4,707 B 2.28 MB 0.92

While:

[ArgumentsSource(nameof(RunArgs))]
public BigInteger Run(int n, StringBuilder output)
{
    var k = BinarySearch(n);
    var (p, q) = SumTerms(0, k - 1);
    p += q;
    var a = BigInteger.Pow(new BigInteger(10), n - 1);
    return p * a / q;
}
Gives us: Method Job Runtime n output Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
Run .NET 8.0 .NET 8.0 80000 233.2 ms 3.85 ms 3.60 ms 1.00 0.00 4,627 B 2.13 MB 1.00
Run .NET 9.0 .NET 9.0 80000 205.6 ms 1.20 ms 1.93 ms 0.88 0.02 9,144 B 2.13 MB 1.00

We can then break apart the last few bits as well, to find out the following

BinarySearch(n)

Method Job Runtime n output Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
Run .NET 8.0 .NET 8.0 80000 125.1 ns 1.79 ns 1.67 ns 1.00 0.00 254 B - NA
Run .NET 9.0 .NET 9.0 80000 125.6 ns 1.21 ns 1.13 ns 1.00 0.02 244 B - NA

SumTerms(0, k - 1)

Method Job Runtime k output Mean Error StdDev Ratio Code Size Gen0 Gen1 Allocated Alloc Ratio
Run .NET 8.0 .NET 8.0 20631 14.09 ms 0.230 ms 0.215 ms 1.00 1,010 B 109.3750 15.6250 1.97 MB 1.00
Run .NET 9.0 .NET 9.0 20631 10.01 ms 0.092 ms 0.086 ms 0.71 995 B 109.3750 15.6250 1.97 MB 1.00

p += q

Method Job Runtime p q output Mean Error StdDev Ratio RatioSD Code Size Gen0 Gen1 Allocated Alloc Ratio
Run .NET 8.0 .NET 8.0 260(...)701 [80052] 151(...)000 [80052] 5.631 us 0.0991 us 0.0927 us 1.00 0.00 310 B 1.9836 0.2441 32.49 KB 1.00
Run .NET 9.0 .NET 9.0 260(...)701 [80052] 151(...)000 [80052] 5.518 us 0.0532 us 0.0472 us 0.98 0.02 306 B 1.9836 0.2441 32.49 KB 1.00

BigInteger.Pow(new BigInteger(10), n - 1)

Method Job Runtime n output Mean Error StdDev Ratio Code Size Allocated Alloc Ratio
Run .NET 8.0 .NET 8.0 80000 7.274 ms 0.1437 ms 0.1476 ms 1.00 1,771 B 32.47 KB 1.00
Run .NET 9.0 .NET 9.0 80000 2.903 ms 0.0358 ms 0.0335 ms 0.40 2,653 B 32.47 KB 1.00

var pa = p * a

Method Job Runtime p a output Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
Run .NET 8.0 .NET 8.0 143(...)400 [80005] 100(...)000 [80000] 4.594 ms 0.0898 ms 0.0840 ms 1.00 0.00 2,374 B 64.91 KB 1.00
Run .NET 9.0 .NET 9.0 143(...)400 [80005] 100(...)000 [80000] 4.607 ms 0.0586 ms 0.0548 ms 1.00 0.02 3,575 B 64.91 KB 1.00

return pa / q

Method Job Runtime pa q output Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
Run .NET 8.0 .NET 8.0 143(...)000 [160004] 527(...)000 [80004] 184.8 ms 1.62 ms 2.88 ms 1.00 0.00 1,616 B 32.71 KB 1.00
Run .NET 9.0 .NET 9.0 143(...)000 [160004] 527(...)000 [80004] 192.0 ms 2.70 ms 3.88 ms 1.04 0.02 2,565 B 32.6 KB 1.00

So this gives us a breakdown where most of the time is being spent (Division and ToString). There's some optimization opportunities in other areas too, but they don't represent the bulk of the work.