dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.17k stars 4.72k forks source link

Regression over .net 8.0 #108934

Open johnnygiter opened 1 week ago

johnnygiter commented 1 week ago
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Configs;
using System;
using System.Numerics;
using BenchmarkDotNet.Jobs;

[DisassemblyDiagnoser]
[SimpleJob(RuntimeMoniker.Net80, baseline: true)]
[SimpleJob(RuntimeMoniker.Net90)]
[GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByMethod)]
public class PerformanceTest
{
    [Benchmark]
    [Arguments(80000)]

    public string Run(int n)
    {
        var result = new System.Text.StringBuilder();
        var k = BinarySearch(n);
        var (p, q) = SumTerms(0, k - 1);
        p += q;
        var a = BigInteger.Pow(new BigInteger(10), n - 1);
        var answer = p * a / q;
        var answerStr = answer.ToString();
        Span<char> sb = stackalloc char[10];
        for (var i = 0; i < n; i += 10)
        {
            var count = i + 10;
            if (count > n)
            {
                count = n;
            }
            for (var j = i; j < i + 10; j++)
            {
                if (j < n)
                {
                    sb[j - i] = answerStr[j];
                }
                else
                {
                    sb[j - i] = ' ';
                }
            }

            result.AppendLine($"{new string(sb)}\t:{count}");
        }
        return result.ToString();
    }

    static (BigInteger, BigInteger) SumTerms(int a, int b)
    {
        if (b == a + 1)
        {
            return (new BigInteger(1), new BigInteger(b));
        }
        var mid = (a + b) / 2;
        var (pLeft, qLeft) = SumTerms(a, mid);
        var (pRight, qRight) = SumTerms(mid, b);
        return (pLeft * qRight + pRight, qLeft * qRight);
    }

    static int BinarySearch(int n)
    {
        var a = 0;
        var b = 1;
        while (!TestK(n, b))
        {
            a = b;
            b *= 2;
        }
        while (b - a > 1)
        {
            var m = (a + b) / 2;
            if (TestK(n, m))
            {
                b = m;
            }
            else
            {
                a = m;
            }
        }
        return b;
    }

    static bool TestK(int n, int k)
    {
        if (k < 0)
        {
            return false;
        }
        var lnKFactorial = k * (Math.Log((double)k) - 1) + 0.5 * Math.Log(Math.PI * 2);
        var log10KFactorial = lnKFactorial / Math.Log(10);
        return log10KFactorial >= (double)(n + 50);
    }

    public static void Main(string[] args)
    {
        var summary = BenchmarkRunner.Run<PerformanceTest>();
        Console.WriteLine(summary);
    }
}
.NET SDK 9.0.100-rc.2.24474.11
  [Host]   : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  .NET 8.0 : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  .NET 9.0 : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2

| Method | Job      | Runtime  | n     | Mean     | Error   | StdDev  | P95      | Ratio | RatioSD | Code Size |
|------- |--------- |--------- |------ |---------:|--------:|--------:|---------:|------:|--------:|----------:|
| Run    | .NET 8.0 | .NET 8.0 | 80000 | 314.1 ms | 2.51 ms | 5.62 ms | 333.0 ms |  1.00 |    0.02 |  11,613 B |
| Run    | .NET 9.0 | .NET 9.0 | 80000 | 371.6 ms | 0.57 ms | 0.53 ms | 372.5 ms |  1.18 |    0.02 |   9,354 B |
AndyAyersMS commented 1 week ago

@amanasifkhalid seems quite likely layout related, can you take a look?

dotnet-policy-service[bot] commented 1 week ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

amanasifkhalid commented 1 week ago

From the .NET 9 JitDump of PerformanceTest::Run, here's the initial RPO layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0091]  1                             1     100 [???..???)-> BB02(1)                 (always)                     i IBC internal
BB02 [0000]  1       BB01                  1     100 [000..03D)-> BB05(0.387),BB04(0.613) ( cond )                     i IBC hascall gcsafe newobj
BB04 [0020]  1       BB02                  0.61   61 [03C..03D)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB05 [0021]  1       BB02                  0.39   39 [03C..03D)-> BB07(0.544),BB06(0.456) ( cond )                     i IBC
BB07 [0027]  1       BB05                  0.21   21 [03C..03D)-> BB08(1)                 (always)                     i IBC idxlen nullcheck
BB06 [0026]  1       BB05                  0.18   18 [03C..03D)-> BB08(1)                 (always)                     i IBC
BB08 [0028]  2       BB06,BB07             0.39   39 [03C..03D)-> BB10(0.544),BB09(0.456) ( cond )                     i IBC
BB10 [0032]  1       BB08                  0.21   21 [03C..03D)-> BB11(1)                 (always)                     i IBC idxlen nullcheck
BB09 [0031]  1       BB08                  0.18   18 [03C..03D)-> BB11(1)                 (always)                     i IBC
BB11 [0033]  2       BB09,BB10             0.39   39 [03C..03D)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB12 [0022]  2       BB04,BB11             1     100 [03C..071)-> BB34(0.1),BB13(0.9)     ( cond )                     i IBC hascall gcsafe newobj
BB13 [0094]  2       BB12,BB33             9.00  900 [071..07D)-> BB16(0.48),BB15(0.52)   ( cond )                     i IBC loophead bwd
BB15 [0002]  1       BB13                  4.68  468 [07D..080)-> BB16(1)                 (always)                     i IBC bwd
BB16 [0003]  2       BB13,BB15             9.00  900 [080..086)-> BB22(0.1),BB17(0.9)     ( cond )                     i IBC bwd
BB17 [0095]  1       BB16                  8.10  810 [086..???)-> BB18(1)                 (always)                     IBC internal
BB18 [0004]  2       BB17,BB21            81.00 8100 [086..08B)-> BB20(0.48),BB19(0.52)   ( cond )                     i IBC loophead bwd bwd-target
BB19 [0005]  1       BB18                 42.12 4212 [08B..0A3)-> BB21(1)                 (always)                     i IBC idxlen bwd
BB20 [0006]  1       BB18                 38.88 3888 [0A3..0B2)-> BB21(1)                 (always)                     i IBC bwd
BB21 [0007]  2       BB19,BB20            81.00 8100 [0B2..0C1)-> BB18(0.9),BB22(0.1)     ( cond )                     i IBC bwd
BB22 [0096]  2       BB16,BB21             9.00  900 [000..0E4)-> BB26(0),BB24(1)         ( cond )                     i IBC newobj bwd
BB24 [0050]  1       BB22                  9.00  900 [000..000)-> BB27(0.00533),BB25(0.995)   ( cond )                     i IBC internal nullcheck bwd
BB25 [0054]  1       BB24                  8.95  895 [000..000)-> BB27(1)                 (always)                     i IBC internal hascall gcsafe idxlen nullcheck bwd
BB26 [0051]  1       BB22                  0       0 [000..000)-> BB27(1)                 (always)                     i IBC rare internal hascall gcsafe bwd
BB27 [0052]  3       BB24,BB25,BB26        9.00  900 [0E4..0E5)-> BB29(0.0319),BB28(0.968)  ( cond )                     i IBC idxlen newobj nullcheck bwd
BB28 [0066]  1       BB27                  1.86  186 [0E4..0E5)-> BB30(1)                 (always)                     i IBC nullcheck bwd
BB29 [0071]  1       BB27                  0.28   28 [0E4..0E5)-> BB30(1)                 (always)                     i IBC hascall gcsafe bwd
BB30 [0073]  2       BB28,BB29             9.00  900 [0E4..0FA)-> BB32(0.0319),BB31(0.968)  ( cond )                     i IBC hascall gcsafe idxlen newobj nullcheck bwd
BB31 [0083]  1       BB30                  1.86  186 [0F9..0FA)-> BB33(1)                 (always)                     i IBC nullcheck bwd
BB32 [0088]  1       BB30                  0.28   28 [0F9..0FA)-> BB33(1)                 (always)                     i IBC hascall gcsafe bwd
BB33 [0090]  2       BB31,BB32             9.00  900 [0F9..110)-> BB13(0.9),BB34(0.1)     ( cond )                     i IBC newobj nullcheck bwd
BB34 [0097]  2       BB12,BB33             1     100 [110..117)                           (return)                     i IBC gcsafe
BB36 [0098]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

There's plenty of opportunity to improve fallthrough on the hot paths, but this looks reasonable. The problem is fgMoveHotJumps is too aggressive in creating fallthrough for BB19 -> BB21, such that once we've moved BB20 out of the way, we give up on keeping it in the loop body, so we keep pushing BB20 down the method to create fallthrough on less important paths. Here's the final layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0091]  1                             1     100 [???..???)-> BB02(1)                 (always)                     i IBC internal
BB02 [0000]  1       BB01                  1     100 [000..03D)-> BB05(0.387),BB04(0.613) ( cond )                     i IBC hascall gcsafe newobj
BB04 [0020]  1       BB02                  0.61   61 [03C..03D)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB12 [0022]  2       BB04,BB11             1     100 [03C..071)-> BB34(0.1),BB13(0.9)     ( cond )                     i IBC hascall gcsafe newobj
BB13 [0094]  2       BB12,BB33             9.00  900 [071..07D)-> BB16(0.48),BB15(0.52)   ( cond )                     i IBC loophead bwd
BB15 [0002]  1       BB13                  4.68  468 [07D..080)-> BB16(1)                 (always)                     i IBC bwd
BB16 [0003]  2       BB13,BB15             9.00  900 [080..086)-> BB22(0.1),BB17(0.9)     ( cond )                     i IBC bwd
BB17 [0095]  1       BB16                  8.10  810 [086..???)-> BB18(1)                 (always)                     IBC internal
BB18 [0004]  2       BB17,BB21            81.00 8100 [086..08B)-> BB20(0.48),BB19(0.52)   ( cond )                     i IBC loophead bwd bwd-target
BB19 [0005]  1       BB18                 42.12 4212 [08B..0A3)-> BB21(1)                 (always)                     i IBC idxlen bwd
BB21 [0007]  2       BB19,BB20            81.00 8100 [0B2..0C1)-> BB18(0.9),BB22(0.1)     ( cond )                     i IBC bwd
BB22 [0096]  2       BB16,BB21             9.00  900 [000..0E4)-> BB26(0),BB24(1)         ( cond )                     i IBC newobj bwd
BB24 [0050]  1       BB22                  9.00  900 [000..000)-> BB27(0.00533),BB25(0.995)   ( cond )                     i IBC internal nullcheck bwd
BB25 [0054]  1       BB24                  8.95  895 [000..000)-> BB27(1)                 (always)                     i IBC internal hascall gcsafe idxlen nullcheck bwd
BB27 [0052]  3       BB24,BB25,BB26        9.00  900 [0E4..0E5)-> BB29(0.0319),BB28(0.968)  ( cond )                     i IBC idxlen newobj nullcheck bwd
BB28 [0066]  1       BB27                  1.86  186 [0E4..0E5)-> BB30(1)                 (always)                     i IBC nullcheck bwd
BB30 [0073]  2       BB28,BB29             9.00  900 [0E4..0FA)-> BB32(0.0319),BB31(0.968)  ( cond )                     i IBC hascall gcsafe idxlen newobj nullcheck bwd
BB31 [0083]  1       BB30                  1.86  186 [0F9..0FA)-> BB33(1)                 (always)                     i IBC nullcheck bwd
BB33 [0090]  2       BB31,BB32             9.00  900 [0F9..110)-> BB13(0.9),BB34(0.1)     ( cond )                     i IBC newobj nullcheck bwd
BB34 [0097]  2       BB12,BB33             1     100 [110..117)                           (return)                     i IBC gcsafe
BB05 [0021]  1       BB02                  0.39   39 [03C..03D)-> BB07(0.544),BB06(0.456) ( cond )                     i IBC
BB07 [0027]  1       BB05                  0.21   21 [03C..03D)-> BB08(1)                 (always)                     i IBC idxlen nullcheck
BB08 [0028]  2       BB06,BB07             0.39   39 [03C..03D)-> BB10(0.544),BB09(0.456) ( cond )                     i IBC
BB10 [0032]  1       BB08                  0.21   21 [03C..03D)-> BB11(1)                 (always)                     i IBC idxlen nullcheck
BB11 [0033]  2       BB09,BB10             0.39   39 [03C..03D)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB06 [0026]  1       BB05                  0.18   18 [03C..03D)-> BB08(1)                 (always)                     i IBC
BB09 [0031]  1       BB08                  0.18   18 [03C..03D)-> BB11(1)                 (always)                     i IBC
BB20 [0006]  1       BB18                 38.88 3888 [0A3..0B2)-> BB21(1)                 (always)                     i IBC bwd
BB29 [0071]  1       BB27                  0.28   28 [0E4..0E5)-> BB30(1)                 (always)                     i IBC hascall gcsafe bwd
BB32 [0088]  1       BB30                  0.28   28 [0F9..0FA)-> BB33(1)                 (always)                     i IBC hascall gcsafe bwd
BB26 [0051]  1       BB22                  0       0 [000..000)-> BB27(1)                 (always)                     i IBC rare internal hascall gcsafe bwd
BB36 [0098]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Disabling the new layout with DOTNET_JitDoReversePostOrderLayout=0 doesn't do much better:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0091]  1                             1     100 [???..???)-> BB02(1)                 (always)                     i IBC internal
BB02 [0000]  1       BB01                  1     100 [000..03D)-> BB05(0.387),BB04(0.613) ( cond )                     i IBC hascall gcsafe newobj
BB04 [0020]  1       BB02                  0.61   61 [03C..03D)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB12 [0022]  2       BB04,BB11             1     100 [03C..071)-> BB39(0.1),BB13(0.9)     ( cond )                     i IBC hascall gcsafe newobj
BB13 [0094]  2       BB12,BB33             9.00  900 [071..07D)-> BB16(0.48),BB15(0.52)   ( cond )                     i IBC loophead bwd
BB15 [0002]  1       BB13                  4.68  468 [07D..080)-> BB16(1)                 (always)                     i IBC bwd
BB16 [0003]  2       BB13,BB15             9.00  900 [080..086)-> BB22(0.1),BB17(0.9)     ( cond )                     i IBC bwd
BB17 [0095]  1       BB16                  8.10  810 [086..???)-> BB18(1)                 (always)                     IBC internal
BB18 [0004]  2       BB17,BB21            81.00 8100 [086..08B)-> BB20(0.48),BB19(0.52)   ( cond )                     i IBC loophead bwd bwd-target
BB19 [0005]  1       BB18                 42.12 4212 [08B..0A3)-> BB21(1)                 (always)                     i IBC idxlen bwd
BB21 [0007]  2       BB19,BB20            81.00 8100 [0B2..0C1)-> BB18(0.9),BB22(0.1)     ( cond )                     i IBC bwd
BB22 [0096]  2       BB16,BB21             9.00  900 [000..0E4)-> BB26(0),BB24(1)         ( cond )                     i IBC newobj bwd
BB24 [0050]  1       BB22                  9.00  900 [000..000)-> BB27(0.00533),BB25(0.995)   ( cond )                     i IBC internal nullcheck bwd
BB25 [0054]  1       BB24                  8.95  895 [000..000)-> BB27(1)                 (always)                     i IBC internal hascall gcsafe idxlen nullcheck bwd
BB27 [0052]  3       BB24,BB25,BB26        9.00  900 [0E4..0E5)-> BB29(0.0319),BB37(0.968)  ( cond )                     i IBC idxlen newobj nullcheck bwd
BB37 [0099]  1       BB27                  1.86  186 [0E4..0E5)-> BB30(1)                 (always)                     i IBC nullcheck bwd
BB30 [0073]  2       BB29,BB37             9.00  900 [0E4..0FA)-> BB32(0.0319),BB38(0.968)  ( cond )                     i IBC hascall gcsafe idxlen newobj nullcheck bwd
BB38 [0100]  1       BB30                  1.86  186 [0F9..0FA)-> BB33(1)                 (always)                     i IBC nullcheck bwd
BB33 [0090]  2       BB32,BB38             9.00  900 [0F9..110)-> BB13(0.9),BB39(0.1)     ( cond )                     i IBC newobj nullcheck bwd
BB39 [0101]  2       BB12,BB33             1     100 [110..117)                           (return)                     i IBC gcsafe
BB20 [0006]  1       BB18                 38.88 3888 [0A3..0B2)-> BB21(1)                 (always)                     i IBC bwd
BB05 [0021]  1       BB02                  0.39   39 [03C..03D)-> BB06(0.456),BB07(0.544) ( cond )                     i IBC
BB07 [0027]  1       BB05                  0.21   21 [03C..03D)-> BB08(1)                 (always)                     i IBC idxlen nullcheck
BB08 [0028]  2       BB06,BB07             0.39   39 [03C..03D)-> BB09(0.456),BB10(0.544) ( cond )                     i IBC
BB10 [0032]  1       BB08                  0.21   21 [03C..03D)-> BB11(1)                 (always)                     i IBC idxlen nullcheck
BB11 [0033]  2       BB09,BB10             0.39   39 [03C..03D)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB29 [0071]  1       BB27                  0.28   28 [0E4..0E5)-> BB30(1)                 (always)                     i IBC hascall gcsafe bwd
BB32 [0088]  1       BB30                  0.28   28 [0F9..0FA)-> BB33(1)                 (always)                     i IBC hascall gcsafe bwd
BB09 [0031]  1       BB08                  0.18   18 [03C..03D)-> BB11(1)                 (always)                     i IBC
BB06 [0026]  1       BB05                  0.18   18 [03C..03D)-> BB08(1)                 (always)                     i IBC
BB26 [0051]  1       BB22                  0       0 [000..000)-> BB27(1)                 (always)                     i IBC rare internal hascall gcsafe bwd
BB36 [0098]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

With a .NET 10 build of my 3-opt layout prototype in #103450, the 3-opt pass makes the loop blocks contiguous again, though the ordering is weird:

Creating fallthrough for BB21 -> BB18 (cost=5832.000000, improvement=7290.000000)
Creating fallthrough for BB20 -> BB21 (cost=0.000000, improvement=3888.000000)
Creating fallthrough for BB33 -> BB13 (cost=405.000000, improvement=810.000000)
Creating fallthrough for BB31 -> BB33 (cost=0.000000, improvement=112.500000)
Creating fallthrough for BB12 -> BB34 (cost=0.000000, improvement=10.000000)

*************** Before renumbering the basic blocks

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0106]  1                             1     100 [???..???)-> BB02(1)                 (always)                     i IBC internal
BB02 [0000]  1       BB01                  1     100 [000..03D)-> BB05(0.584),BB04(0.416) ( cond )                     i IBC hascall gcsafe newobj
BB05 [0021]  1       BB02                  0.58   58 [03C..03D)-> BB07(0.519),BB06(0.481) ( cond )                     i IBC
BB07 [0027]  1       BB05                  0.30   30 [03C..03D)-> BB08(1)                 (always)                     i IBC idxlen nullcheck
BB08 [0028]  2       BB06,BB07             0.58   58 [03C..03D)-> BB10(0.519),BB09(0.481) ( cond )                     i IBC
BB10 [0032]  1       BB08                  0.30   30 [03C..03D)-> BB11(1)                 (always)                     i IBC idxlen nullcheck
BB11 [0033]  2       BB09,BB10             0.58   58 [03C..03D)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB12 [0022]  2       BB04,BB11             1     100 [03C..071)-> BB34(0.1),BB13(0.9)     ( cond )                     i IBC hascall gcsafe newobj
BB34 [0112]  2       BB12,BB33             1     100 [110..117)                           (return)                     i IBC gcsafe
BB31 [0093]  1       BB30                  1.13  113 [0F9..0FA)-> BB33(1)                 (always)                     i IBC nullcheck bwd
BB33 [0100]  2       BB31,BB32             9.00  900 [0F9..110)-> BB13(0.9),BB34(0.1)     ( cond )                     i IBC newobj nullcheck bwd
BB13 [0109]  2       BB12,BB33             9.00  900 [071..07D)-> BB16(0.48),BB15(0.52)   ( cond )                     i IBC loophead bwd
BB15 [0002]  1       BB13                  4.68  468 [07D..080)-> BB16(1)                 (always)                     i IBC bwd
BB16 [0003]  2       BB15,BB13             9.00  900 [080..086)-> BB22(0.1),BB17(0.9)     ( cond )                     i IBC bwd
BB17 [0110]  1       BB16                  8.10  810 [086..???)-> BB18(1)                 (always)                     IBC internal
BB20 [0006]  1       BB18                 38.88 3888 [0A3..0B2)-> BB21(1)                 (always)                     i IBC bwd
BB21 [0007]  2       BB19,BB20            81.00 8100 [0B2..0C1)-> BB18(0.9),BB22(0.1)     ( cond )                     i IBC bwd
BB18 [0004]  2       BB21,BB17            81.00 8100 [086..08B)-> BB20(0.48),BB19(0.52)   ( cond )                     i IBC loophead bwd bwd-target
BB19 [0005]  1       BB18                 42.12 4212 [08B..0A3)-> BB21(1)                 (always)                     i IBC idxlen bwd
BB22 [0111]  2       BB16,BB21             9.00  900 [000..0E4)-> BB26(0),BB24(1)         ( cond )                     i IBC newobj bwd
BB24 [0055]  1       BB22                  9.00  900 [000..000)-> BB27(0.0105),BB25(0.989)  ( cond )                     i IBC internal nullcheck bwd
BB25 [0059]  1       BB24                  8.91  891 [000..000)-> BB27(1)                 (always)                     i IBC internal hascall gcsafe idxlen nullcheck bwd
BB27 [0057]  3       BB24,BB26,BB25        9.00  900 [0E4..0E5)-> BB29(0.5),BB28(0.5)     ( cond )                     i IBC idxlen newobj nullcheck bwd
BB28 [0071]  1       BB27                  1.13  113 [0E4..0E5)-> BB30(1)                 (always)                     i IBC nullcheck bwd
BB06 [0026]  1       BB05                  0.28   28 [03C..03D)-> BB08(1)                 (always)                     i IBC
BB09 [0031]  1       BB08                  0.28   28 [03C..03D)-> BB11(1)                 (always)                     i IBC
BB04 [0020]  1       BB02                  0.42   42 [03C..03D)-> BB12(1)                 (always)                     i IBC hascall gcsafe
BB29 [0076]  1       BB27                  2.25  225 [0E4..0E5)-> BB30(1)                 (always)                     i IBC hascall gcsafe bwd
BB30 [0078]  2       BB28,BB29             9.00  900 [0E4..0FA)-> BB32(0.5),BB31(0.5)     ( cond )                     i IBC hascall gcsafe idxlen newobj nullcheck bwd
BB32 [0098]  1       BB30                  2.25  225 [0F9..0FA)-> BB33(1)                 (always)                     i IBC hascall gcsafe bwd
BB26 [0056]  1       BB22                  0       0 [000..000)-> BB27(1)                 (always)                     i IBC rare internal hascall gcsafe bwd
BB36 [0113]  0                             0         [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Running 3-opt again, or skipping fgMoveHotJumps beforehand, doesn't fix the branchy loop block ordering, probably because the cost model doesn't differentiate between jump types -- this example is a good motivator for developing a cost model that doesn't rely on edge weights alone.

AndyAyersMS commented 1 week ago

Running 3-opt again, or skipping fgMoveHotJumps beforehand, doesn't fix the branchy loop block ordering, probably because the cost model doesn't differentiate between jump types -- this example is a good motivator for developing a cost model that doesn't rely on edge weights alone

By this you mean differing costs for BBJ_ALWAYS vs BBJ_COND?

AndyAyersMS commented 1 week ago

Also do we know if the perf is a consequence of the new layout, or are we running into JCC errata or the like?

amanasifkhalid commented 1 week ago

Also do we know if the perf is a consequence of the new layout, or are we running into JCC errata or the like?

I see multiple instances of JCC errata in both .NET 8 and .NET 9's disasms, and both incur the mitigation penalty in the method's hottest loop. .NET 8:

G_M000_IG15:                ;; offset=0x0261
       cmp      ecx, ebx
       jge      G_M000_IG24

G_M000_IG16:                ;; offset=0x0269
       mov      eax, ecx
       sub      eax, r15d
       cmp      eax, 10
       jae      G_M000_IG40
       cmp      ecx, dword ptr [rdi+0x08]
       jae      G_M000_IG40
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 0 ; jcc erratum) 32B boundary ...............................
       mov      edx, ecx
       movzx    rdx, word  ptr [rdi+2*rdx+0x0C]
       mov      word  ptr [r14+2*rax], dx
       jmp      G_M000_IG25

G_M000_IG17:                ;; offset=0x0291
       mov      rdx, gword ptr [rbp-0x68]
       test     rdx, rdx
       jne      SHORT G_M000_IG19

.NET 9 (Checked build, hence the bbWeights):

G_M61133_IG16:  ;; offset=0x0240
       cmp      ecx, ebx
       jge      G_M61133_IG37
                        ;; size=8 bbWeight=81.00 PerfScore 101.25
G_M61133_IG17:  ;; offset=0x0248
       mov      eax, ecx
       sub      eax, r12d
       cmp      eax, 10
       jae      G_M61133_IG41
       cmp      ecx, dword ptr [r15+0x08]
       jae      G_M61133_IG41
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jae: 0 ; jcc erratum) 32B boundary ...............................
       movzx    rdx, word  ptr [r15+2*rcx+0x0C]
       mov      word  ptr [r13+2*rax], dx
                        ;; size=36 bbWeight=42.12 PerfScore 368.55
G_M61133_IG18:  ;; offset=0x026C
       inc      ecx
       cmp      ecx, edi
       jl       SHORT G_M61133_IG16
                        ;; size=6 bbWeight=81.00 PerfScore 121.50

The .NET 9 layout has one more JCC erratum site overall, but it's not in a consequential place, so I'd expect the sites in the above loop to dominate the overall JCC erratum penalty. Here are the benchmark results on my AMD machine, which doesn't suffer from JCC erratum penalties:

| Method | Job      | Runtime  | n     | Mean     | Error   | StdDev  | Ratio | Code Size |
|------- |--------- |--------- |------ |---------:|--------:|--------:|------:|----------:|
| Run    | .NET 8.0 | .NET 8.0 | 80000 | 496.9 ms | 0.46 ms | 0.41 ms |  1.00 |  10,012 B |
| Run    | .NET 9.0 | .NET 9.0 | 80000 | 536.2 ms | 0.36 ms | 0.33 ms |  1.08 |   9,321 B |

@johnnygiter could you please share what CPU you ran this benchmark on? Thanks!

johnnygiter commented 1 week ago

@amanasifkhalid, Intel Core i5-14600K, 1 CPU, 20 logical and 14 physical cores, Win 11 latest. I had similar difference on i9 14900K on other workstation,

Not only it's not impressive regarding past .net core versions (low or no improvement) but using same algo even slow Python is much much faster(over 300%), I know it's hard to compare anyway .net suffer somehow in this type of bechmark.

I wonder, where is this great improvements of BigInteger then? (works even slower than 8)