Open saucecontrol opened 6 years ago
cc: @dotnet/jit-contrib
I'll take a quick look. Given that it is an x86 regression my guess it might be similar to what we see over in dotnet/runtime#9833.
Thanks, @AndyAyersMS. I did some further optimization of that inner loop code in my main project that altered results slightly. Don't know if it will help track down the regression. Basically, I added more local variables to hold the data block being hashed. Of course those end up getting spilled, but having them and the accumulator variables in the same stack memory area speeds things up slightly. I also re-ordered some of the operations to reduce register dependencies.
Anyway, the updated version of the function now performs the same on 1.1 and 2.0 on the x64 runtimes but has a consistent 4% slowdown in 2.1. x86 shows the same major slowdown between 1.1 and 2.0 and again between 2.0 and 2.1.
Method | Job | Platform | IsBaseline | Mean | Error | StdDev | Scaled | Allocated |
---|---|---|---|---|---|---|---|---|
Blake2bFast | netcoreapp1.1 | X64 | True | 10.69 ms | 0.0147 ms | 0.0123 ms | 1.00 | 0 B |
Blake2bFast | netcoreapp2.0 | X64 | Default | 10.68 ms | 0.0610 ms | 0.0571 ms | 1.00 | 0 B |
Blake2bFast | netcoreapp2.1 | X64 | Default | 11.14 ms | 0.0383 ms | 0.0359 ms | 1.04 | 0 B |
Blake2bFast | netcoreapp1.1 | X86 | True | 69.34 ms | 0.4273 ms | 0.3997 ms | 1.00 | 0 B |
Blake2bFast | netcoreapp2.0 | X86 | Default | 108.07 ms | 0.6709 ms | 0.6275 ms | 1.56 | 0 B |
Blake2bFast | netcoreapp2.1 | X86 | Default | 129.12 ms | 0.7999 ms | 0.7483 ms | 1.86 | 0 B |
One would hope x86 is ~2x slower than x64 as primary data type is long/ulong. But no...
Looking at x86, 2.0 vs 2.1 there is a large size regression. Likely RA.
2.0
; Total bytes of code 49767, prolog size 31 for method Blake2bContext:mixManualInline(int):this
2.1
; Total bytes of code 60608, prolog size 31 for method Blake2bContext:mixManualInline(int):this
Suffice to say this is something of a torture test for an x86 allocator, leading to long stretches of stack shuffling:
89B574FCFFFF mov dword ptr [ebp-38CH], esi
8BB578FCFFFF mov esi, dword ptr [ebp-388H]
81F66BBD41FB xor esi, 0xFB41BD6B
89B578FCFFFF mov dword ptr [ebp-388H], esi
8BB574FCFFFF mov esi, dword ptr [ebp-38CH]
81F6ABD9831F xor esi, 0x1F83D9AB
89B574FCFFFF mov dword ptr [ebp-38CH], esi
8BB578FCFFFF mov esi, dword ptr [ebp-388H]
89B594FEFFFF mov dword ptr [ebp-16CH], esi
8BB574FCFFFF mov esi, dword ptr [ebp-38CH]
89B590FEFFFF mov dword ptr [ebp-170H], esi
8BF1 mov esi, ecx
03B5F4FEFFFF add esi, dword ptr [ebp-10CH]
89B574FCFFFF mov dword ptr [ebp-38CH], esi
8BF2 mov esi, edx
13B5F0FEFFFF adc esi, dword ptr [ebp-110H]
89B578FCFFFF mov dword ptr [ebp-388H], esi
8B75F0 mov esi, dword ptr [ebp-10H]
8B8574FCFFFF mov eax, dword ptr [ebp-38CH]
0306 add eax, dword ptr [esi]
898574FCFFFF mov dword ptr [ebp-38CH], eax
8B8578FCFFFF mov eax, dword ptr [ebp-388H]
134604 adc eax, dword ptr [esi+4]
8BB574FCFFFF mov esi, dword ptr [ebp-38CH]
The methods are too big to make sense of as is (especially given the semi-arbitrary IG breaks) so we should cut them down to a representative sample. Probably just one round of mixing would be plenty to surface the challenges the jit must overcome.
Also need to look into what kind of code jit32 produced in 1.1.
@saucecontrol Unrolled hash computations are often difficult cases for the allocator because everything ends up highly interdependent. But clearly we have a lot of room for improvement on x86 both on a release/release basis and by comparison to where we are on x64.
Speaking of comparisons, do you have a native (C/C++) implementation that you use as a reference point? It also would be interesting to see how far off our x86/x64 performance is from pure native code.
Sure do. There's a project called Blake2.Bench in that same repo that has comparisons with the reference C version called via PInvoke. My C# version started out with about the same performance on x64, but after optimization, it's quite a bit faster. Performance on x86 has never been as good, but 1.1 did admirably.
Here's another run of that last bench with the native results inlcuded
Method | Job | Platform | IsBaseline | Mean | Error | StdDev | Scaled | Allocated |
---|---|---|---|---|---|---|---|---|
Blake2bFast | netcoreapp1.1 | X64 | True | 10.73 ms | 0.0682 ms | 0.0638 ms | 1.00 | 0 B |
Blake2bFast | netcoreapp2.0 | X64 | Default | 10.79 ms | 0.0524 ms | 0.0490 ms | 1.01 | 0 B |
Blake2bFast | netcoreapp2.1 | X64 | Default | 11.19 ms | 0.0609 ms | 0.0569 ms | 1.04 | 0 B |
Blake2bRefNative | netcoreapp2.1 | X64 | Default | 16.12 ms | 0.0334 ms | 0.0279 ms | 1.50 | 0 B |
Blake2bFast | netcoreapp1.1 | X86 | True | 69.10 ms | 0.2009 ms | 0.1781 ms | 1.00 | 0 B |
Blake2bFast | netcoreapp2.0 | X86 | Default | 108.28 ms | 0.6423 ms | 0.5694 ms | 1.57 | 0 B |
Blake2bFast | netcoreapp2.1 | X86 | Default | 128.76 ms | 0.3564 ms | 0.3159 ms | 1.86 | 0 B |
Blake2bRefNative | netcoreapp2.1 | X86 | Default | 50.00 ms | 0.2553 ms | 0.2132 ms | 0.72 | 0 B |
Since register allocation is such an issue, I reckon it's expected x86 would be more than 2x slower since it has less than half the registers to work with. Combine that with the ulong calculations, and 4x slower is probably about right for a target.
I don't know if this is useful or not, but I also have an implementation of the Blake2s algorithm in the main project in that repo. It's the exact same thing but operates on uint words instead of ulong and does 10 mixing rounds instead of 12.
Method | Job | Platform | IsBaseline | Mean | Error | StdDev | Scaled | Allocated |
---|---|---|---|---|---|---|---|---|
Blake2sFast | netcoreapp1.1 | X64 | True | 17.06 ms | 0.2094 ms | 0.1959 ms | 1.00 | 0 B |
Blake2sFast | netcoreapp2.0 | X64 | Default | 17.34 ms | 0.1110 ms | 0.1038 ms | 1.02 | 0 B |
Blake2sFast | netcoreapp2.1 | X64 | Default | 17.23 ms | 0.0149 ms | 0.0107 ms | 1.01 | 0 B |
Blake2sRefNative | netcoreapp2.1 | X64 | Default | 25.89 ms | 0.0216 ms | 0.0168 ms | 1.52 | 0 B |
Blake2sFast | netcoreapp1.1 | X86 | True | 67.97 ms | 0.5795 ms | 0.5420 ms | 1.00 | 0 B |
Blake2sFast | netcoreapp2.0 | X86 | Default | 26.92 ms | 0.0720 ms | 0.0638 ms | 0.40 | 0 B |
Blake2sFast | netcoreapp2.1 | X86 | Default | 30.73 ms | 0.2364 ms | 0.2211 ms | 0.45 | 0 B |
Blake2sRefNative | netcoreapp2.1 | X86 | Default | 32.10 ms | 0.0524 ms | 0.0464 ms | 0.47 | 0 B |
Interesting that RyuJIT-32 does so much better between 1.1 and 2.0 here, but there's a definite regression between 2.0 and 2.1.
Core 1.1 used the older "jit32" jit for x86, and Core 2.0 and later uses RyuJit.
Generally speaking this switch to RyuJit resulted in improved performance for x86. But not always.
I think the perf issues in the initial long
version of your code that show up in 2.0 and 2.1 are related to how RyuJit handles the 64 bit long
computations on x86. The main challenge here is that these need to be decomposed into two 32 bit computations and various complications arise depending on when and how this is done.
Long decomposition doubles the number of things that deserve registers (vs same computation on x64) while at the same time the allocator is trying to cope with the fact that on x86 there are only ~6 allocatable registers, while on x64 there are ~13. Twice as much demand and half as much supply.
I don't have anything specific to call out just yet though, or a good explanation of why it started off poorly and has gotten worse.
Core 1.1 used the older "jit32" jit for x86, and Core 2.0 and later uses RyuJit.
Ohhhhhhh… I feel like I knew that at one point and it fell out of my brain. Doesn't help that BDN will only run Job.RyuJitX86
with their netcore toolchains 😉. That does at least explain some of it.
I found another case that might be related and is much more simple to follow.
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Xeon CPU E3-1505M v6 3.00GHz, 1 CPU, 8 logical and 4 physical cores
Frequency=2929685 Hz, Resolution=341.3336 ns, Timer=TSC
.NET Core SDK=2.1.301
[Host] : .NET Core 2.1.1 (CoreCLR 4.6.26606.02, CoreFX 4.6.26606.05), 32bit RyuJIT
net46 : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.3110.0
netcoreapp1.1 : .NET Core 1.1.8 (CoreCLR 4.6.26328.01, CoreFX 4.6.24705.01), 32bit RyuJIT
netcoreapp2.0 : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 32bit RyuJIT
netcoreapp2.1 : .NET Core 2.1.1 (CoreCLR 4.6.26606.02, CoreFX 4.6.26606.05), 32bit RyuJIT
Method | Job | Runtime | IsBaseline | Mean | Error | StdDev | Scaled |
---|---|---|---|---|---|---|---|
FixedPoint | net46 | Clr | Default | 17.10 us | 0.0742 us | 0.0694 us | 1.00 |
FixedPoint | netcoreapp1.1 | Core | True | 17.15 us | 0.0985 us | 0.0921 us | 1.00 |
FixedPoint | netcoreapp2.0 | Core | Default | 18.37 us | 0.0978 us | 0.0915 us | 1.07 |
FixedPoint | netcoreapp2.1 | Core | Default | 18.56 us | 0.1381 us | 0.1292 us | 1.08 |
This one shows a consistent 6-8% regression between jit32 and RyuJIT on this machine and more like 10-20% on another I tested.
class GreyConverter
{
unsafe static void ToGreyFixedPoint(byte* prgb, byte* pgrey, int cb)
{
const ushort scale = 1 << 15;
const ushort round = 1 << 14;
const ushort rw = (ushort)(0.299f * scale + 0.5f);
const ushort gw = (ushort)(0.587f * scale + 0.5f);
const ushort bw = (ushort)(0.114f * scale + 0.5f);
byte* end = prgb + cb;
while (prgb < end)
{
int val = prgb[0] * rw + prgb[1] * gw + prgb[2] * bw;
*pgrey = (byte)((val + round) >> 15);
prgb += 3;
pgrey++;
}
}
byte[] Grey;
byte[] Rgb;
public GreyConverter()
{
Grey = new byte[16384];
Rgb = new byte[Grey.Length * 3];
new Random(42).NextBytes(Rgb);
}
[Benchmark]
unsafe public void FixedPoint()
{
fixed (byte* prgb = &Rgb[0], pgrey = &Grey[0])
ToGreyFixedPoint(prgb, pgrey, Rgb.Length);
}
}
Here's the BDN assembly dump from the legacy JIT:
081f9690 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
081f9697 8b7d08 mov edi,dword ptr [ebp+8]
081f969a 03f9 add edi,ecx
081f969c 3bcf cmp ecx,edi
081f969e 7334 jae 081f96d4
081f96a0 0fb611 movzx edx,byte ptr [ecx]
081f96a3 69d246260000 imul edx,edx,2646h
081f96a9 0fb64101 movzx eax,byte ptr [ecx+1]
081f96ad 69c0234b0000 imul eax,eax,4B23h
081f96b3 03d0 add edx,eax
081f96b5 0fb64102 movzx eax,byte ptr [ecx+2]
081f96b9 69c0980e0000 imul eax,eax,0E98h
081f96bf 03d0 add edx,eax
081f96c1 81c200400000 add edx,4000h
081f96c7 c1fa0f sar edx,0Fh
081f96ca 8816 mov byte ptr [esi],dl
081f96cc 83c103 add ecx,3
081f96cf 46 inc esi
081f96d0 3bcf cmp ecx,edi
081f96d2 72cc jb 081f96a0
081f96d4 5e pop esi
And RyuJIT:
05649198 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
0564919e 8bc1 mov eax,ecx
056491a0 034508 add eax,dword ptr [ebp+8]
056491a3 3bc8 cmp ecx,eax
056491a5 7336 jae 056491dd
056491a7 0fb619 movzx ebx,byte ptr [ecx]
056491aa 69f346260000 imul esi,ebx,2646h
056491b0 0fb65901 movzx ebx,byte ptr [ecx+1]
056491b4 69fb234b0000 imul edi,ebx,4B23h
056491ba 03f7 add esi,edi
056491bc 0fb65902 movzx ebx,byte ptr [ecx+2]
056491c0 69fb980e0000 imul edi,ebx,0E98h
056491c6 03f7 add esi,edi
056491c8 81c600400000 add esi,4000h
056491ce 8bde mov ebx,esi
056491d0 c1fb0f sar ebx,0Fh
056491d3 881a mov byte ptr [edx],bl
056491d5 83c103 add ecx,3
056491d8 42 inc edx
056491d9 3bc8 cmp ecx,eax
056491db 72ca jb 056491a7
056491dd 5b pop ebx
The two differences that jump out at me:
1) RyuJIT uses an extra register and has that extra mov ebx, esi
2) jit32 sets up the first two movzx
imul
combinations so they can run in parallel whereas RyuJIT uses ebx
for both.
jit32 sets up the first two movzx imul combinations so they can run in parallel whereas RyuJIT uses ebx for both.
The fact that the same register is used for both shouldn't make any differences, CPUs handle such cases by register renaming.
RyuJIT uses an extra register and has that extra mov ebx, esi
That may be. In general, register to register moves should have very low cost, in many cases being handled by register renaming and thus generating no actual μops. Still, they can degrade performance, if only because of the increased code size.
Hmmm... I didn't think about the register renaming, but if the CPU is doing it in both cases, I'm surprised a 2 byte code size increase would have such a large perf impact. I'll add more cases if I see anything different come up.
Perf difference might be caused by loop top alignment, note the backedge target addresses:
;;; legacy jit
081f96d2 72cc jb 081f96a0
;;; ryjit
056491db 72ca jb 056491a7
There is a perf benefit to having frequent branch targets be addresses == 0 mod 16 (or these days mod 32).
Not sure if legacy jit gets lucky here or would pad above the loops with a slightly different example.
Looks like the alignment was purely coincidental. I added an extra increment before the loop to offset each of them by 1. The RyuJIT version still came out 8% slower with its loop top 16-byte aligned
Method | Job | IsBaseline | Mean | Error | StdDev | Scaled |
---|---|---|---|---|---|---|
FixedPoint | netcoreapp1.1 | True | 17.37 us | 0.0726 us | 0.0679 us | 1.00 |
FixedPoint | netcoreapp2.1 | Default | 18.78 us | 0.1559 us | 0.1458 us | 1.08 |
jit32 asm
074e9690 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
074e9697 41 inc ecx
074e9698 8b7d08 mov edi,dword ptr [ebp+8]
074e969b 03f9 add edi,ecx
074e969d 3bcf cmp ecx,edi
074e969f 7334 jae 074e96d5
074e96a1 0fb611 movzx edx,byte ptr [ecx]
074e96a4 69d246260000 imul edx,edx,2646h
074e96aa 0fb64101 movzx eax,byte ptr [ecx+1]
074e96ae 69c0234b0000 imul eax,eax,4B23h
074e96b4 03d0 add edx,eax
074e96b6 0fb64102 movzx eax,byte ptr [ecx+2]
074e96ba 69c0980e0000 imul eax,eax,0E98h
074e96c0 03d0 add edx,eax
074e96c2 81c200400000 add edx,4000h
074e96c8 c1fa0f sar edx,0Fh
074e96cb 8816 mov byte ptr [esi],dl
074e96cd 83c103 add ecx,3
074e96d0 46 inc esi
074e96d1 3bcf cmp ecx,edi
074e96d3 72cc jb 074e96a1
074e96d5 5e pop esi
and RyuJIT
084e91a0 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
084e91a6 41 inc ecx
084e91a7 8bc1 mov eax,ecx
084e91a9 034508 add eax,dword ptr [ebp+8]
084e91ac 3bc8 cmp ecx,eax
084e91ae 7336 jae 084e91e6
084e91b0 0fb619 movzx ebx,byte ptr [ecx]
084e91b3 69f346260000 imul esi,ebx,2646h
084e91b9 0fb65901 movzx ebx,byte ptr [ecx+1]
084e91bd 69fb234b0000 imul edi,ebx,4B23h
084e91c3 03f7 add esi,edi
084e91c5 0fb65902 movzx ebx,byte ptr [ecx+2]
084e91c9 69fb980e0000 imul edi,ebx,0E98h
084e91cf 03f7 add esi,edi
084e91d1 81c600400000 add esi,4000h
084e91d7 8bde mov ebx,esi
084e91d9 c1fb0f sar ebx,0Fh
084e91dc 881a mov byte ptr [edx],bl
084e91de 83c103 add ecx,3
084e91e1 42 inc edx
084e91e2 3bc8 cmp ecx,eax
084e91e4 72ca jb 084e91b0
084e91e6 5b pop ebx
In newer processors crossing a 32-byte boundary in a loop makes a difference because of µop cache. It's definitely the case for loops that are 32 bytes long or smaller. Here the loops are larger than 32 bytes but they cross 32-byte boundary once in jit32 case and twice in RyuJit case. That may affect perf. Can you try aligning the loops at 32 bytes in both cases?
See 9.3 in http://www.agner.org/optimize/microarchitecture.pdf for details about µop cache and 32-bytes alignment.
I managed to get both loop tops 32-byte aligned, but I'm not seeing any changes to the perf numbers. And the perf delta holds up even if I force the jit32 output to be misaligned.
Method | Job | IsBaseline | Mean | Error | StdDev | Scaled |
---|---|---|---|---|---|---|
FixedPoint | netcoreapp1.1 | True | 17.36 us | 0.1503 us | 0.1406 us | 1.00 |
FixedPoint | netcoreapp2.1 | Default | 18.64 us | 0.1099 us | 0.0974 us | 1.07 |
jit32
07a59690 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
07a59697 8b7d08 mov edi,dword ptr [ebp+8]
07a5969a 03f9 add edi,ecx
07a5969c 3bcf cmp ecx,edi
07a5969e 7334 jae 07a596d4
07a596a0 0fb611 movzx edx,byte ptr [ecx]
07a596a3 69d246260000 imul edx,edx,2646h
07a596a9 0fb64101 movzx eax,byte ptr [ecx+1]
07a596ad 69c0234b0000 imul eax,eax,4B23h
07a596b3 03d0 add edx,eax
07a596b5 0fb64102 movzx eax,byte ptr [ecx+2]
07a596b9 69c0980e0000 imul eax,eax,0E98h
07a596bf 03d0 add edx,eax
07a596c1 81c200400000 add edx,4000h
07a596c7 c1fa0f sar edx,0Fh
07a596ca 8816 mov byte ptr [esi],dl
07a596cc 83c103 add ecx,3
07a596cf 46 inc esi
07a596d0 3bcf cmp ecx,edi
07a596d2 72cc jb 07a596a0
07a596d4 5e pop esi
RyuJIT
07a191a0 GreyConverter.ToGreyFixedPoint(Byte*, Byte*, Int32)
07a191a9 81c100010000 add ecx,100h
07a191af 81c200010000 add edx,100h
07a191b5 83c010 add eax,10h
07a191b8 8d440110 lea eax,[ecx+eax+10h]
07a191bc 3bc8 cmp ecx,eax
07a191be 7336 jae 07a191f6
07a191c0 0fb619 movzx ebx,byte ptr [ecx]
07a191c3 69f346260000 imul esi,ebx,2646h
07a191c9 0fb65901 movzx ebx,byte ptr [ecx+1]
07a191cd 69fb234b0000 imul edi,ebx,4B23h
07a191d3 03f7 add esi,edi
07a191d5 0fb65902 movzx ebx,byte ptr [ecx+2]
07a191d9 69fb980e0000 imul edi,ebx,0E98h
07a191df 03f7 add esi,edi
07a191e1 81c600400000 add esi,4000h
07a191e7 8bde mov ebx,esi
07a191e9 c1fb0f sar ebx,0Fh
07a191ec 881a mov byte ptr [edx],bl
07a191ee 83c103 add ecx,3
07a191f1 42 inc edx
07a191f2 3bc8 cmp ecx,eax
07a191f4 72ca jb 07a191c0
07a191f6 5b pop ebx
json-benchmark (java&.net) Java‘s DSL-JSON is so fast ...
The benchmark almost include the fastest .net json library,like Jil,NetJson.... There was hardly any .NET JSON library can be higher.
when could be CLR can be faster more than JVM?
@sgf that benchmark is not using .net core though. It might still be slower. But just pointing out.
I saw there were some recent LSRA changes and decided to revisit this to see if there was any improvement. I managed to distill my original repro down to a much more reasonable size, with only half a round of hash mixing and with just enough math to touch all the variables. Current code is here and the current numbers look like this:
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview-010184
[Host] : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
netcoreapp1.1 : .NET Core 1.1.10 (CoreCLR 4.6.26906.01, CoreFX 4.6.24705.01), 64bit RyuJIT
netcoreapp2.0 : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
netcoreapp2.1 : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
netcoreapp3.0 : .NET Core 3.0.0-preview-27324-5 (CoreCLR 4.6.27322.0, CoreFX 4.7.19.7311), 64bit RyuJIT
Method | Job | Platform | Mean | Error | StdDev | Scaled | ScaledSD |
---|---|---|---|---|---|---|---|
SimplifiedHash | netcoreapp1.1 | X64 | 1.054 ms | 0.0135 ms | 0.0126 ms | 1.00 | 0.00 |
SimplifiedHash | netcoreapp2.0 | X64 | 1.070 ms | 0.0157 ms | 0.0147 ms | 1.02 | 0.02 |
SimplifiedHash | netcoreapp2.1 | X64 | 1.070 ms | 0.0170 ms | 0.0159 ms | 1.02 | 0.02 |
SimplifiedHash | netcoreapp3.0 | X64 | 1.068 ms | 0.0137 ms | 0.0128 ms | 1.01 | 0.02 |
SimplifiedHash | netcoreapp1.1 | X86 | 3.005 ms | 0.0224 ms | 0.0199 ms | 1.00 | 0.00 |
SimplifiedHash | netcoreapp2.0 | X86 | 5.881 ms | 0.0203 ms | 0.0170 ms | 1.96 | 0.01 |
SimplifiedHash | netcoreapp2.1 | X86 | 6.705 ms | 0.0364 ms | 0.0322 ms | 2.23 | 0.02 |
SimplifiedHash | netcoreapp3.0 | X86 | 6.739 ms | 0.0519 ms | 0.0485 ms | 2.24 | 0.02 |
The x64 regression between 1.1 and 2.0 is barely visible, but there is a code size increase to back it up. The x86 regression is quite clear both in the numbers and in the asm. I went ahead and captured JitDisasm and JitDump output for the simplified mixing method on every runtime version (3.0 is from current master) to make this easier to review. Dumps are here: jitdumps.zip
I get similar numbers:
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.18323
Intel Core i7-4770HQ CPU 2.20GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview4-010539
[Host] : .NET Core 3.0.0-preview4-27427-5 (CoreCLR 4.6.27427.71, CoreFX 4.7.19.12613), 64bit RyuJIT
netcoreapp1.1 : .NET Core 1.1.2 (CoreCLR 4.6.25211.01, CoreFX 4.6.24705.01), 64bit RyuJIT
netcoreapp2.0 : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
netcoreapp2.1 : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
netcoreapp3.0 : .NET Core 3.0.0-preview4-27427-5 (CoreCLR 4.6.27427.71, CoreFX 4.7.19.12613), 64bit RyuJIT
Method | Job | Platform | IsBaseline | Mean | Error | StdDev | Scaled | ScaledSD |
---|---|---|---|---|---|---|---|---|
SimplifiedHash | netcoreapp1.1 | X64 | True | 1.259 ms | 0.0142 ms | 0.0111 ms | 1.00 | 0.00 |
SimplifiedHash | netcoreapp2.0 | X64 | Default | 1.342 ms | 0.0126 ms | 0.0099 ms | 1.07 | 0.01 |
SimplifiedHash | netcoreapp2.1 | X64 | Default | 1.316 ms | 0.0229 ms | 0.0214 ms | 1.05 | 0.02 |
SimplifiedHash | netcoreapp3.0 | X64 | Default | 1.351 ms | 0.0141 ms | 0.0125 ms | 1.07 | 0.01 |
SimplifiedHash | netcoreapp1.1 | X86 | True | 4.063 ms | 0.0655 ms | 0.0613 ms | 1.00 | 0.00 |
SimplifiedHash | netcoreapp2.0 | X86 | Default | 8.563 ms | 0.0485 ms | 0.0379 ms | 2.11 | 0.03 |
SimplifiedHash | netcoreapp2.1 | X86 | Default | 8.597 ms | 0.0266 ms | 0.0236 ms | 2.12 | 0.03 |
SimplifiedHash | netcoreapp3.0 | X86 | Default | 8.534 ms | 0.1033 ms | 0.0915 ms | 2.10 | 0.04 |
For the latest RyuJit on x86, we've gotten ourselves into a situation where the only available register for long stretches of code is esi
. Not surprisingly, this leads to a lot of shuffling...
lea esi, [ecx+176]
mov dword ptr [ebp-D8H], esi
mov esi, dword ptr [esi]
mov dword ptr [ebp-ECH], esi
mov esi, dword ptr [ebp-D8H]
G_M44156_IG04:
mov esi, dword ptr [esi+4]
mov dword ptr [ebp-F0H], esi
mov esi, dword ptr [ebp-ECH]
mov dword ptr [ebp-78H], esi
mov esi, dword ptr [ebp-F0H]
mov dword ptr [ebp-7CH], esi
lea esi, [ecx+184]
mov dword ptr [ebp-DCH], esi
mov esi, dword ptr [esi]
mov dword ptr [ebp-F0H], esi
mov esi, dword ptr [ebp-DCH]
mov esi, dword ptr [esi+4]
mov dword ptr [ebp-ECH], esi
mov esi, dword ptr [ebp-F0H]
mov dword ptr [ebp-80H], esi
mov esi, dword ptr [ebp-ECH]
mov dword ptr [ebp-84H], esi
lea esi, [ecx+192]
mov dword ptr [ebp-E0H], esi
mov esi, dword ptr [esi]
mov dword ptr [ebp-ECH], esi
mov esi, dword ptr [ebp-E0H]
mov esi, dword ptr [esi+4]
mov dword ptr [ebp-F0H], esi
mov esi, dword ptr [ebp-ECH]
xor esi, 0xD1FFAB1E
mov dword ptr [ebp-ECH], esi
mov esi, dword ptr [ebp-F0H]
xor esi, 0xD1FFAB1E
I think part of the issue here is that when we decompose a long GT_IND, we save the address to a temp, even if that address is something we could trivially recompute. So we see
lea esi, [ecx+176] // address of low part of some long
mov dword ptr [ebp-D8H], esi // spill the address
mov esi, dword ptr [esi] // load low part
mov dword ptr [ebp-ECH], esi // spill the low part
mov esi, dword ptr [ebp-D8H] // reload the address
G_M44156_IG04:
mov esi, dword ptr [esi+4] // bump address to point at high part
mov dword ptr [ebp-F0H], esi // load the high part
when we could just as easily have the following, and save a bit of register pressure:
mov esi, dword ptr [ecx + 176] // load low part
mov dword ptr [ebp-ECH], esi // spill the low part
G_M44156_IG04:
mov esi, dword ptr [ecx + 180] // bump to point at high part
mov dword ptr [ebp-F0H], esi // load the high part
The other thing I saw was that when a long load is dead but might cause an exception, we don't simplify to something like a null check like jit32 does. Again this saves needing a register.
Let me look into long decomposition and see how hard it would be to teach it not to create a temp for simple address forms.
I have a prototype here OptimizeDecomposeInd... it helps somewhat but we're still nowhere close to jit32.
PMI Diffs for D:\repos\Blake2Fast\tests\JitRegression\bin\Release\netcoreapp3.0\publish\JitRegression.dll for x86 default jit
Summary:
(Lower is better)
Total bytes of diff: -181 (-2.09% of base)
diff is an improvement.
Top file improvements by size (bytes):
-181 : JitRegression.dasm (-2.09% of base)
1 total files with size differences (1 improved, 0 regressed), 0 unchanged.
Top method improvements by size (bytes):
-168 (-5.65% of base) : JitRegression.dasm - Blake2bContext:mixSimplified(int,int)
-6 (-1.19% of base) : JitRegression.dasm - Blake2bContext:Update(ref):this
-3 (-4.11% of base) : JitRegression.dasm - Blake2bContext:addLength(int):this
-2 (-0.45% of base) : JitRegression.dasm - Blake2bContext:Init(int,ref):this
-2 (-0.70% of base) : JitRegression.dasm - Blake2bContext:Finish():ref:this
Top method improvements by size (percentage):
-168 (-5.65% of base) : JitRegression.dasm - Blake2bContext:mixSimplified(int,int)
-3 (-4.11% of base) : JitRegression.dasm - Blake2bContext:addLength(int):this
-6 (-1.19% of base) : JitRegression.dasm - Blake2bContext:Update(ref):this
-2 (-0.70% of base) : JitRegression.dasm - Blake2bContext:Finish():ref:this
-2 (-0.45% of base) : JitRegression.dasm - Blake2bContext:Init(int,ref):this
5 total methods with size differences (5 improved, 0 regressed), 36 unchanged.
Can do similar for stores but since those are all at the end I don't expect that to have a large impact in this case.
Also reasonably broad impact across the frameworks....
PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x86 default jit
Summary:
(Lower is better)
Total bytes of diff: -13859 (-0.05% of base)
diff is an improvement.
Top file regressions by size (bytes):
20 : System.Net.WebSockets.dasm (0.06% of base)
Top file improvements by size (bytes):
-1621 : System.Private.CoreLib.dasm (-0.04% of base)
-1526 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.07% of base)
-1146 : CommandLine.dasm (-0.41% of base)
-826 : Microsoft.CodeAnalysis.dasm (-0.07% of base)
-791 : System.Memory.dasm (-0.38% of base)
83 total files with size differences (82 improved, 1 regressed), 46 unchanged.
Top method regressions by size (bytes):
381 (11.88% of base) : System.Private.CoreLib.dasm - DecCalc:VarDecDiv(byref,byref)
137 ( 6.24% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Analysis.TraceGarbageCollector:Calculate():this
89 ( 4.55% of base) : System.Private.CoreLib.dasm - Grisu3:TryDigitGenShortest(byref,byref,byref,struct,byref,byref):bool
64 ( 5.74% of base) : Microsoft.CodeAnalysis.dasm - Microsoft.CodeAnalysis.Compilation:ConstructModuleSerializationProperties(ref,ref,struct):ref:this
46 (12.23% of base) : System.Linq.Parallel.dasm - LongMinMaxAggregationOperatorEnumerator`1[Double][System.Double]:MoveNextCore(byref):bool:this
Top method improvements by size (bytes):
-183 (-4.11% of base) : System.Private.Uri.dasm - System.Uri:CheckAuthorityHelper(int,ushort,ushort,byref,byref,ref,byref):ushort:this
-137 (-2.20% of base) : System.Reflection.Metadata.dasm - System.Reflection.PortableExecutable.PEHeader:.ctor(byref):this
-65 (-7.58% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Int64,Int64][System.Int64,System.Int64]:ToString():ref:this
-47 (-1.88% of base) : System.Private.Xml.dasm - System.Xml.Serialization.XmlSerializationWriterCodeGen:WriteEnumMethod(ref):this
-43 (-17.77% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Int64,Int64][System.Int64,System.Int64]:Equals(ref):bool:this
Top method regressions by size (percentage):
13 (24.07% of base) : System.Linq.Expressions.dasm - <>c__3`2[Int32,Int64][System.Int32,System.Int64]:<RunOnEmptyStack>b__3_0(ref):ref:this
46 (12.23% of base) : System.Linq.Parallel.dasm - LongMinMaxAggregationOperatorEnumerator`1[Double][System.Double]:MoveNextCore(byref):bool:this
381 (11.88% of base) : System.Private.CoreLib.dasm - DecCalc:VarDecDiv(byref,byref)
27 ( 9.68% of base) : System.Linq.Parallel.dasm - System.Linq.Parallel.RepeatEnumerable`1[Int64][System.Int64]:GetPartitions(int):ref:this
6 ( 7.50% of base) : System.Collections.Immutable.dasm - System.Linq.ImmutableArrayExtensions:First(struct):long
Top method improvements by size (percentage):
-11 (-37.93% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - Microsoft.Diagnostics.Tracing.Etlx.TraceModuleFile:get_ImageEnd():long:this
-15 (-28.85% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[__Canon,Int64][System.__Canon,System.Int64]:get_index():long:this
-15 (-28.85% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Int32,Int64][System.Int32,System.Int64]:get_index():long:this
-15 (-28.85% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Double,Int64][System.Double,System.Int64]:get_index():long:this
-15 (-28.85% of base) : jit-analyze.dasm - <>f__AnonymousType0`2[Vector`1,Int64][System.Numerics.Vector`1[System.Single],System.Int64]:get_index():long:this
4642 total methods with size differences (4418 improved, 224 regressed), 221321 unchanged.
Will need to spend more time looking into this to figure out what else must happen to get x86 up to par (and perhaps arm32 too).
Moving this to future as I don't see any quick resolution that would fit into 3.0.
Was just giving this another look with 3.1 RTM, and whatever was going wrong with LSRA on x86 is now impacting x64 as well. The code size really exploded in 3.1.
Runtime | x86 Code Size | x64 Code Size |
---|---|---|
netcoreapp1.1 | 1534 | 577 |
netcoreapp2.0 | 2906 | 602 |
netcoreapp2.1 | 2956 | 604 |
netcoreapp3.1 | 3926 | 2505 |
And the benchmarks are showing the expected perf drop.
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.18362
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.100
[Host] : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), 64bit RyuJIT
netcoreapp1.1 : .NET Core 1.1.13 (CoreCLR 4.6.27618.02, CoreFX 4.6.24705.01), 64bit RyuJIT
netcoreapp2.0 : .NET Core 2.0.9 (CoreCLR 4.6.26614.01, CoreFX 4.6.26614.01), 64bit RyuJIT
netcoreapp2.1 : .NET Core 2.1.14 (CoreCLR 4.6.28207.04, CoreFX 4.6.28208.01), 64bit RyuJIT
netcoreapp3.1 : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), 64bit RyuJIT
Runtime | Platform | Mean | Error | StdDev | Scaled |
---|---|---|---|---|---|
netcoreapp1.1 | X64 | 1.002 ms | 0.0090 ms | 0.0084 ms | 1.00 |
netcoreapp2.0 | X64 | 1.039 ms | 0.0182 ms | 0.0152 ms | 1.04 |
netcoreapp2.1 | X64 | 1.039 ms | 0.0200 ms | 0.0178 ms | 1.04 |
netcoreapp3.1 | X64 | 1.154 ms | 0.0222 ms | 0.0256 ms | 1.15 |
netcoreapp1.1 | X86 | 2.944 ms | 0.0154 ms | 0.0144 ms | 1.00 |
netcoreapp2.0 | X86 | 5.780 ms | 0.0375 ms | 0.0313 ms | 1.96 |
netcoreapp2.1 | X86 | 6.545 ms | 0.0329 ms | 0.0292 ms | 2.22 |
netcoreapp3.1 | X86 | 6.768 ms | 0.0422 ms | 0.0374 ms | 2.30 |
JitDisasm and JitDump output for all versions are here: simplifiedhash_jitdumps.zip
cc @CarolEidt -- some interesting stress cases for LSRA
The x64 disassembly for 3.1 shows that tier 0 was used to generate code...
; Tier-0 compilation
; compiler->opts.MinOpts() is true
D'oh. Been a while since that got me.
Updated code sizes still show another increase but not nearly so extreme. Benchmark numbers were correct.
Runtime | x86 Code Size | x64 Code Size |
---|---|---|
netcoreapp1.1 | 1534 | 577 |
netcoreapp2.0 | 2906 | 602 |
netcoreapp2.1 | 2956 | 604 |
netcoreapp3.1 | 3057 | 632 |
And here are the updated dumps: simplifiedhash_jitdumps.zip
Thanks, a quick look over the disassembly diff would indicate that this is more likely to be an issue with null check elimination. I see lots of extra null checks in various places, like here:
- mov r14, qword ptr [rcx+144]
- mov r15, qword ptr [rcx+152]
- mov r12, qword ptr [rcx+160]
- mov r13, qword ptr [rcx+168]
- mov rbx, qword ptr [rcx+176]
- mov rdi, qword ptr [rcx+184]
- mov rsi, 0xD1FFAB1E
- xor rsi, qword ptr [rcx+192]
+ mov rbp, rdx
+ cmp dword ptr [rcx], ecx
+ mov r14, qword ptr [rcx+136]
+ cmp dword ptr [rcx], ecx
+ mov r15, qword ptr [rcx+144]
+ cmp dword ptr [rcx], ecx
+ mov r12, qword ptr [rcx+152]
+ cmp dword ptr [rcx], ecx
+ mov r13, qword ptr [rcx+160]
+ cmp dword ptr [rcx], ecx
+ mov rbx, qword ptr [rcx+168]
+ cmp dword ptr [rcx], ecx
+ mov rdi, qword ptr [rcx+176]
+ cmp dword ptr [rcx], ecx
+ mov rsi, qword ptr [rcx+184]
Thanks for looking into it. You're right, of course.
I actually ran across that issue a few months back and updated the main project (from which the sample is distilled) to replace the struct pointer argument with an interior pointer to the hash state. Running the benchmarks against the main project, there doesn't appear to be a new regression in 3.1... but the drop-off from legacy JIT is still severe.
The x86 issue is the same as dotnet/runtime#8846, and is something I plan to be working on soon (see dotnet/runtime#9399). The former is a loop, and may well need additional work beyond what's proposed in dotnet/runtime#9399, but it suffers from the same fundamental issue that the register allocator will use a bad register that's free (even one that will immediately have to be spilled) rather than spill a register that's occupied.
I've looked a bit more into those null checks and the reason they're not removed is pretty hilarious:
[ 0] 161 (0x0a1) ldarg.0
[ 1] 162 (0x0a2) ldflda 0400000A
[ 1] 167 (0x0a7) ldflda 04000014
[ 1] 172 (0x0ac) ldc.i4.2 2 ; constant
[ 2] 173 (0x0ad) conv.i ; constant
[ 2] 174 (0x0ae) ldc.i4.8 8 ; constant
[ 3] 175 (0x0af) mul ; constant
[ 2] 176 (0x0b0) add
[ 1] 177 (0x0b1) ldind.i8
[ 1] 178 (0x0b2) stloc.s 10
So the C# compiler somehow manages to emit a constant tree and the JIT gets rid of it too late:
fgMorphTree BB01, STMT00018 (before)
[000183] -A-XG------- * ASG long
[000182] D------N---- +--* LCL_VAR long V12 loc10
[000181] *--XG------- \--* IND long
[000180] ---XG------- \--* ADD long
[000174] ---XG------- +--* ADDR long
[000173] ---XG------- | \--* FIELD long FixedElementField
[000172] ---XG------- | \--* ADDR long
[000171] ---XG------- | \--* FIELD struct h
[000170] ------------ | \--* LCL_VAR long V00 arg0
[000179] ------------ \--* MUL long
[000176] ------------ +--* CAST long <- int
[000175] ------------ | \--* CNS_INT int 2
[000178] ------------ \--* CAST long <- int
[000177] ------------ \--* CNS_INT int 8
Before calling fgAddFieldSeqForZeroOffset:
[000173] ---XG--N---- * IND long
[000172] ---XG------- \--* ADDR long
[000171] ---XG------- \--* FIELD struct h
[000170] ------------ \--* LCL_VAR long V00 arg0
fgAddFieldSeqForZeroOffset for Fseq[FixedElementField]
addr (Before)
[000172] ---XG------- ADDR long
(After)
[000172] ---XG------- ADDR long Zero Fseq[FixedElementField]
Before explicit null check morphing:
[000171] ---XG--N---- * FIELD struct h
[000170] ------------ \--* LCL_VAR long V00 arg0
After adding explicit null check:
[000171] ---XG--N---- * IND struct
[000694] ---X-------- \--* COMMA long
[000690] ---X---N---- +--* NULLCHECK byte
[000689] ------------ | \--* LCL_VAR long V00 arg0
The reason for the null check is mac->m_allConstantOffsets
being false
due to that constant tree that's not yet a CNS_INT
.
I haven't checked how come this worked before 3.0 nor if there's any quick fix for this. It looks to me that the main reason for the MAC's mess is the existence of ADDR
nodes so... back to "get rid of ADDR" work.
The generated code also contains a bunch of extra LEA
s that I suspect my trivial forward substitution experiment may eliminate, I'll check that another time.
@CarolEidt you might want to look at this one again as at the heart of it all there are likely LSRA challenges (made worse by clunky long decomposition).
Hash computations tend to be pretty stressful for the allocator.
I've been trying to come up with a simple repro for a perf regression I've been fighting, and I think I have it trimmed down as much as possible while still showing the real-world impact. What I have here is a stripped-down version of my Blake2b hashing algorithm implementation.
The
mixSimplified
method here shows the problem. Code size has grown with each new JIT since 1.1, and performance has dropped.JitDisasm and JitDump output for all versions are here: simplifiedhash_jitdumps.zip
category:cq theme:register-allocator skill-level:expert cost:large