dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.94k stars 4.64k forks source link

RyuJIT unsafe code performance regressions (includes repro-code) #4400

Closed redknightlois closed 3 years ago

redknightlois commented 9 years ago

TLDR: With the exception of a single scenario where calling native (which I am very glad it was optimized) for all the rest RyuJIT is slower (sometimes by a big margin).

Since I discovered https://github.com/dotnet/coreclr/issues/1306 I have been testing some of the most performance sensitives operations we have. I tested 3 different optimized routines, unsafe optimized memory copy, unsafe memory compare and unsafe xxHash32 hashing algorithm.

The code for all of those is available on github.

Sizes are the following:

See benchmark code for details: https://gist.github.com/redknightlois/f71700d2705a7f9fe312

Hashing with xxHash32

// BenchmarkDotNet=v0.7.6.0 // OS=Microsoft Windows NT 6.2.9200.0 // Processor=Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, ProcessorCount=4 // CLR=MS.NET 4.0.30319.42000, Arch=32-bit Common: Type=HashingBenchmark Mode=Throughput Platform=X64 .NET=Current

Method Jit AvrTime StdDev op/s
HashHuge LegacyJit 3.47 ms 18.01 us 288.12
HashHuge RyuJit 4.00 ms 16.09 us 249.97
HashBlock LegacyJit 110.30 ns 1.85 ns 9066050.21
HashBlock RyuJit 128.70 ns 0.709 ns 7770162.29
HashSmall LegacyJit 19.98 ns 0.129 ns 50038564.91
HashSmall RyuJit 22.71 ns 0.105 ns 44037116.02
HashVerySmall LegacyJit 10.27 ns 0.0658 ns 97396785.55
HashVerySmall RyuJit 12.30 ns 1.01 ns 81326622.54
HashTiny LegacyJit 5.78 ns 0.128 ns 173011900.05
HashTiny RyuJit 8.68 ns 0.0472 ns 115268519.76

The difference in throughput for 16Mb buffers is 4.61 Gigabytes per second from the LegacyJit to 3.99 Gigabytes per sec with RyuJIT. (600 Mb per second difference which is not a minimal difference).

Memory Compare

// BenchmarkDotNet=v0.7.6.0 // OS=Microsoft Windows NT 6.2.9200.0 // Processor=Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, ProcessorCount=4 // CLR=MS.NET 4.0.30319.42000, Arch=32-bit Common: Type=MemoryCompareBenchmark Mode=Throughput Platform=X64 .NET=Current

Method Jit AvrTime StdDev op/s
CompareHuge LegacyJit 2.91 ms 45.50 us 343.48
CompareHuge RyuJit 3.04 ms 137.93 us 328.72
CompareBlock LegacyJit 326.74 ns 0.851 ns 3060564.93
CompareBlock RyuJit 339.95 ns 1.07 ns 2941580.42
CompareSmall LegacyJit 12.51 ns 0.0534 ns 79917385.04
CompareSmall RyuJit 15.20 ns 0.0826 ns 65798760.77
CompareVerySmall LegacyJit 5.38 ns 0.0461 ns 185792782.3
CompareVerySmall RyuJit 6.24 ns 0.0288 ns 160192381.38
CompareTiny LegacyJit 5.06 ns 0.0507 ns 197789356.92
CompareTiny RyuJit 5.07 ns 0.0453 ns 197293614.74

Again in the case of memory compare the difference still exist, but it is not that big. 5.24Gb/sec for RyuJIT vs. 5.48Gb/sec for the LegacyJIT.

Memory Copy

// BenchmarkDotNet=v0.7.6.0 // OS=Microsoft Windows NT 6.2.9200.0 // Processor=Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, ProcessorCount=4 // CLR=MS.NET 4.0.30319.42000, Arch=32-bit Common: Type=MemoryCopyBenchmark Mode=Throughput Platform=X64 .NET=Current

Method Jit AvrTime StdDev op/s
CopyHuge LegacyJit 2.47 ms 65.96 us 404.97
CopyHuge RyuJit 2.22 ms 58.07 us 449.92
CopyBlock LegacyJit 37.60 ns 0.371 ns 26598625.77
CopyBlock RyuJit 26.63 ns 0.221 ns 37548121.15
CopySmall LegacyJit 9.04 ns 1.04 ns 110636584.56
CopySmall RyuJit 10.39 ns 0.0851 ns 96216641.11
CopyVerySmall LegacyJit 4.46 ns 0.0206 ns 224426710.4
CopyVerySmall RyuJit 7.14 ns 0.0328 ns 140078221.54
CopyTiny LegacyJit 4.45 ns 0.0353 ns 224550382.94
CopyTiny RyuJit 7.72 ns 0.0625 ns 129580962.59

It appears that for very small RyuJit is worse (16 bytes), for middle size (64 bytes), and for big LegacyJIT is worse ( 4096+ bytes )

For buffers of size 16 MB the throughput is 7.18Gb/sec for RyuJIT to 6.4Gb/sec for the LegacyJIT; but we lost for the small buffers. This means that the marshaling of calls to unmanaged code has been greatly improved. (kudos on that)

category:cq theme:loop-opt skill-level:intermediate cost:medium

leppie commented 9 years ago

Would be interesting to see NGEN results too. It seemed to be on the conservative side. Does it even use the 'current' JIT? Can you switch it? If so, I can test a lot more code for it ;p

redknightlois commented 9 years ago

You will have to ask @AndreyAkinshin if he supports NGEN on BenchmarkDotNet. :)

masonwheeler commented 9 years ago

@leppie NGEN doesn't use a JIT at all, pretty much by definition.

The purpose of a JIT is to build quick-and-dirty executable code ASAP, at the expense of quality. An AOT compiler takes its time to produce a higher-quality build.

leppie commented 9 years ago

@masonwheeler I understand that, but they presumably support a large common codebase.

Eyas commented 9 years ago

[snip: corrected] * thought NGEN was native code generation.

mikedn commented 9 years ago

@leppie @masonwheeler @Eyas NGEN (aka crossgen in CoreCLR) uses the same compiler, RyuJIT in this case. In general there are no differences between code that's generated at runtime and code that's generated by NGEN.

@redknightlois I took a quick look at the hash case and it may be that the difference in performance is caused by an optimization related to the buffer variable - those separate buffer += sizeof(uint) get folded into a single addition and into the memory operands generated for pointer dereferences.

Changing code like below (hope I didn't break it :sleeping:) results in better performance with RyuJIT while keeping the same performance with JIT64:

uint* bfi = (uint*)buffer;

do {
    v1 += bfi[0] * PRIME32_2;
    v2 += bfi[1] * PRIME32_2;
    v3 += bfi[2] * PRIME32_2;
    v4 += bfi[3] * PRIME32_2;
    bfi += 4;

    v1 = RotateLeft32(v1, 13);
    v2 = RotateLeft32(v2, 13);
    v3 = RotateLeft32(v3, 13);
    v4 = RotateLeft32(v4, 13);

    v1 *= PRIME32_1;
    v2 *= PRIME32_1;
    v3 *= PRIME32_1;
    v4 *= PRIME32_1;
}
while (bfi <= limit);
redknightlois commented 9 years ago

@mikedn That should work, originally I havent use indexing because JIT64 is a bit unpredictable on when it hits an optimized path and when it doesnt. So I learned to avoid that pattern altogether.

AndreyAkinshin commented 9 years ago

@leppie, @redknightlois, BenchmarkDotNet doesn't support NGEN. However,

  1. The NGEN results shouldn't differ from usual results because BenchmarkDotNet does multistep warm-up, target measures doesn't include the jitting time.
  2. Anyway, each benchmark method is transformed to a separated project. So, you can run it with NGEN manually.
mikedn commented 9 years ago

It looks like the issue with the CopySmall, CopyVerySmall and CopyTiny cases is that CopyInline is not inlined by RyuJIT but it is inlined by JIT64.

mikedn commented 9 years ago

The compare cases appear to be affected by loop head/branch target alignment. There's also a potential issue in the tail loop that contains 2 redundant loads which RyuJIT does not eliminate. It could be changed to:

   TAIL:
            while (last > 0)
            {
                byte x = *bpx;
                byte y = *pby;
                if (x != y)
                    return x - y;

                bpx++;
                bpy++;
                last--;
            }
redknightlois commented 9 years ago

Its definitely an improvement, but with 16MB buffers still the difference is noticeable.

// BenchmarkDotNet=v0.7.6.0 // OS=Microsoft Windows NT 6.2.9200.0 // Processor=Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, ProcessorCount=4 // CLR=MS.NET 4.0.30319.42000, Arch=32-bit Common: Type=HashingBenchmark Mode=Throughput Platform=X64 .NET=Current

Method Jit AvrTime StdDev op/s
HashBlock LegacyJit 110.98 ns 1.57 ns 9010673.07
HashBlock RyuJit 127.91 ns 0.948 ns 7818076.43
HashBlockAlt LegacyJit 108.08 ns 2.88 ns 9252102.73
HashBlockAlt RyuJit 110.14 ns 0.754 ns 9079637.52
HashHuge LegacyJit 3.46 ms 17.55 us 288.93
HashHuge RyuJit 4.01 ms 21.41 us 249.09
HashHugeAlt LegacyJit 3.51 ms 26.32 us 284.6
HashHugeAlt RyuJit 3.78 ms 341.05 us 264.66
BruceForstall commented 3 years ago

Given that the benchmark code appears to no longer be available, and there's nothing obviously actionable here, I'm going to go ahead and close this.