RyuJIT unsafe code performance regressions (includes repro-code)

redknightlois commented 9 years ago

TLDR: With the exception of a single scenario where calling native (which I am very glad it was optimized) for all the rest RyuJIT is slower (sometimes by a big margin).

Since I discovered https://github.com/dotnet/coreclr/issues/1306 I have been testing some of the most performance sensitives operations we have. I tested 3 different optimized routines, unsafe optimized memory copy, unsafe memory compare and unsafe xxHash32 hashing algorithm.

The code for all of those is available on github.

Memory-Copy & Memory-Compare: https://github.com/ayende/ravendb/blob/master/Raven.Sparrow/Sparrow/Memory.cs
xxHash32 algorithm: https://github.com/ayende/ravendb/blob/master/Raven.Sparrow/Sparrow/Hashing.cs

Sizes are the following:

Huge: 16Mb per run
Block: 4Kb per run
Small: 64bytes per run
Very Small: 16bytes per run
Tiny: 6 bytes per run

See benchmark code for details: https://gist.github.com/redknightlois/f71700d2705a7f9fe312

Hashing with xxHash32

// BenchmarkDotNet=v0.7.6.0 // OS=Microsoft Windows NT 6.2.9200.0 // Processor=Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, ProcessorCount=4 // CLR=MS.NET 4.0.30319.42000, Arch=32-bit Common: Type=HashingBenchmark Mode=Throughput Platform=X64 .NET=Current

Method	Jit	AvrTime	StdDev	op/s
HashHuge	LegacyJit	3.47 ms	18.01 us	288.12
HashHuge	RyuJit	4.00 ms	16.09 us	249.97
HashBlock	LegacyJit	110.30 ns	1.85 ns	9066050.21
HashBlock	RyuJit	128.70 ns	0.709 ns	7770162.29
HashSmall	LegacyJit	19.98 ns	0.129 ns	50038564.91
HashSmall	RyuJit	22.71 ns	0.105 ns	44037116.02
HashVerySmall	LegacyJit	10.27 ns	0.0658 ns	97396785.55
HashVerySmall	RyuJit	12.30 ns	1.01 ns	81326622.54
HashTiny	LegacyJit	5.78 ns	0.128 ns	173011900.05
HashTiny	RyuJit	8.68 ns	0.0472 ns	115268519.76

The difference in throughput for 16Mb buffers is 4.61 Gigabytes per second from the LegacyJit to 3.99 Gigabytes per sec with RyuJIT. (600 Mb per second difference which is not a minimal difference).

Memory Compare

// BenchmarkDotNet=v0.7.6.0 // OS=Microsoft Windows NT 6.2.9200.0 // Processor=Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, ProcessorCount=4 // CLR=MS.NET 4.0.30319.42000, Arch=32-bit Common: Type=MemoryCompareBenchmark Mode=Throughput Platform=X64 .NET=Current

Method	Jit	AvrTime	StdDev	op/s
CompareHuge	LegacyJit	2.91 ms	45.50 us	343.48
CompareHuge	RyuJit	3.04 ms	137.93 us	328.72
CompareBlock	LegacyJit	326.74 ns	0.851 ns	3060564.93
CompareBlock	RyuJit	339.95 ns	1.07 ns	2941580.42
CompareSmall	LegacyJit	12.51 ns	0.0534 ns	79917385.04
CompareSmall	RyuJit	15.20 ns	0.0826 ns	65798760.77
CompareVerySmall	LegacyJit	5.38 ns	0.0461 ns	185792782.3
CompareVerySmall	RyuJit	6.24 ns	0.0288 ns	160192381.38
CompareTiny	LegacyJit	5.06 ns	0.0507 ns	197789356.92
CompareTiny	RyuJit	5.07 ns	0.0453 ns	197293614.74

Again in the case of memory compare the difference still exist, but it is not that big. 5.24Gb/sec for RyuJIT vs. 5.48Gb/sec for the LegacyJIT.

Memory Copy

// BenchmarkDotNet=v0.7.6.0 // OS=Microsoft Windows NT 6.2.9200.0 // Processor=Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, ProcessorCount=4 // CLR=MS.NET 4.0.30319.42000, Arch=32-bit Common: Type=MemoryCopyBenchmark Mode=Throughput Platform=X64 .NET=Current

Method	Jit	AvrTime	StdDev	op/s
CopyHuge	LegacyJit	2.47 ms	65.96 us	404.97
CopyHuge	RyuJit	2.22 ms	58.07 us	449.92
CopyBlock	LegacyJit	37.60 ns	0.371 ns	26598625.77
CopyBlock	RyuJit	26.63 ns	0.221 ns	37548121.15
CopySmall	LegacyJit	9.04 ns	1.04 ns	110636584.56
CopySmall	RyuJit	10.39 ns	0.0851 ns	96216641.11
CopyVerySmall	LegacyJit	4.46 ns	0.0206 ns	224426710.4
CopyVerySmall	RyuJit	7.14 ns	0.0328 ns	140078221.54
CopyTiny	LegacyJit	4.45 ns	0.0353 ns	224550382.94
CopyTiny	RyuJit	7.72 ns	0.0625 ns	129580962.59

It appears that for very small RyuJit is worse (16 bytes), for middle size (64 bytes), and for big LegacyJIT is worse ( 4096+ bytes )

For buffers of size 16 MB the throughput is 7.18Gb/sec for RyuJIT to 6.4Gb/sec for the LegacyJIT; but we lost for the small buffers. This means that the marshaling of calls to unmanaged code has been greatly improved. (kudos on that)

category:cq theme:loop-opt skill-level:intermediate cost:medium

leppie commented 9 years ago

Would be interesting to see NGEN results too. It seemed to be on the conservative side. Does it even use the 'current' JIT? Can you switch it? If so, I can test a lot more code for it ;p

redknightlois commented 9 years ago

You will have to ask @AndreyAkinshin if he supports NGEN on BenchmarkDotNet. :)

masonwheeler commented 9 years ago

@leppie NGEN doesn't use a JIT at all, pretty much by definition.

The purpose of a JIT is to build quick-and-dirty executable code ASAP, at the expense of quality. An AOT compiler takes its time to produce a higher-quality build.

leppie commented 9 years ago

@masonwheeler I understand that, but they presumably support a large common codebase.

Eyas commented 9 years ago

[snip: corrected] * thought NGEN was native code generation.

mikedn commented 9 years ago

@leppie @masonwheeler @Eyas NGEN (aka crossgen in CoreCLR) uses the same compiler, RyuJIT in this case. In general there are no differences between code that's generated at runtime and code that's generated by NGEN.

@redknightlois I took a quick look at the hash case and it may be that the difference in performance is caused by an optimization related to the buffer variable - those separate buffer += sizeof(uint) get folded into a single addition and into the memory operands generated for pointer dereferences.

Changing code like below (hope I didn't break it :sleeping:) results in better performance with RyuJIT while keeping the same performance with JIT64:

uint* bfi = (uint*)buffer;

do {
    v1 += bfi[0] * PRIME32_2;
    v2 += bfi[1] * PRIME32_2;
    v3 += bfi[2] * PRIME32_2;
    v4 += bfi[3] * PRIME32_2;
    bfi += 4;

    v1 = RotateLeft32(v1, 13);
    v2 = RotateLeft32(v2, 13);
    v3 = RotateLeft32(v3, 13);
    v4 = RotateLeft32(v4, 13);

    v1 *= PRIME32_1;
    v2 *= PRIME32_1;
    v3 *= PRIME32_1;
    v4 *= PRIME32_1;
}
while (bfi <= limit);

redknightlois commented 9 years ago

@mikedn That should work, originally I havent use indexing because JIT64 is a bit unpredictable on when it hits an optimized path and when it doesnt. So I learned to avoid that pattern altogether.

AndreyAkinshin commented 9 years ago

@leppie, @redknightlois, BenchmarkDotNet doesn't support NGEN. However,

The NGEN results shouldn't differ from usual results because BenchmarkDotNet does multistep warm-up, target measures doesn't include the jitting time.
Anyway, each benchmark method is transformed to a separated project. So, you can run it with NGEN manually.

mikedn commented 9 years ago

It looks like the issue with the CopySmall, CopyVerySmall and CopyTiny cases is that CopyInline is not inlined by RyuJIT but it is inlined by JIT64.

mikedn commented 9 years ago

The compare cases appear to be affected by loop head/branch target alignment. There's also a potential issue in the tail loop that contains 2 redundant loads which RyuJIT does not eliminate. It could be changed to:

   TAIL:
            while (last > 0)
            {
                byte x = *bpx;
                byte y = *pby;
                if (x != y)
                    return x - y;

                bpx++;
                bpy++;
                last--;
            }

redknightlois commented 9 years ago

Its definitely an improvement, but with 16MB buffers still the difference is noticeable.

// BenchmarkDotNet=v0.7.6.0 // OS=Microsoft Windows NT 6.2.9200.0 // Processor=Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, ProcessorCount=4 // CLR=MS.NET 4.0.30319.42000, Arch=32-bit Common: Type=HashingBenchmark Mode=Throughput Platform=X64 .NET=Current

Method	Jit	AvrTime	StdDev	op/s
HashBlock	LegacyJit	110.98 ns	1.57 ns	9010673.07
HashBlock	RyuJit	127.91 ns	0.948 ns	7818076.43
HashBlockAlt	LegacyJit	108.08 ns	2.88 ns	9252102.73
HashBlockAlt	RyuJit	110.14 ns	0.754 ns	9079637.52
HashHuge	LegacyJit	3.46 ms	17.55 us	288.93
HashHuge	RyuJit	4.01 ms	21.41 us	249.09
HashHugeAlt	LegacyJit	3.51 ms	26.32 us	284.6
HashHugeAlt	RyuJit	3.78 ms	341.05 us	264.66

BruceForstall commented 3 years ago

Given that the benchmark code appears to no longer be available, and there's nothing obviously actionable here, I'm going to go ahead and close this.

dotnet / runtime