dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

Create HW intrinsics performance tests to track performance change and validate implementation #9908

Open 4creators opened 6 years ago

4creators commented 6 years ago

HW intrinsics project is approaching a point where it will be possible to write real life code using available ISAs. This should allow to validate intrinsics implementation using complex functionality like cryptographic, media and scientific algorithms/applications implementation.

Possible candidates include: SHA-3 Keccak algorithm, JPEG 2000 / JPEG algorithms, FFT, some Regex algorithms and many others.

This issue could be used for discussion on performance tests and their implementation.

@CarolEidt @fiigii @mikedn @sdmaclea @tannergooding @AndyAyersMS

category:testing theme:hardware-intrinsics skill-level:intermediate cost:medium

CarolEidt commented 6 years ago

@4creators - thanks for creating this issue. I would love to see some performance tests.

saucecontrol commented 6 years ago

I've ported the SSE4.1 implementation of the Blake2 hashing algorithms from here using the new intrinsics support. The results are impressive, particularly on 32-bit.

Here's some benchmark results comparing the SSE version with the scalar version across a bunch of JIT versions. Benchmark code is here.

My setup

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Xeon CPU E3-1505M v6 3.00GHz, 1 CPU, 8 logical and 4 physical cores
Frequency=2929692 Hz, Resolution=341.3328 ns, Timer=TSC
  [Host]        : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3101.0
  net46         : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit LegacyJIT/clrjit-v4.7.3101.0;compatjit-v4.7.3101.0
  netcoreapp1.1 : .NET Core 1.1.8 (CoreCLR 4.6.26328.01, CoreFX 4.6.24705.01), 64bit RyuJIT
  netcoreapp2.0 : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT
  netcoreapp2.1 : .NET Core 2.1.0-rc1 (CoreCLR 4.6.26426.02, CoreFX 4.6.26426.04), 64bit RyuJIT

And the results. The netcoreapp2.1 runs here are the SSE code.

Method Job Jit Platform IsBaseline Mean Error StdDev Scaled ScaledSD
Blake2bFast net46 LegacyJit X64 Default 29.32 ms 0.1556 ms 0.1379 ms 2.08 0.01
Blake2bFast netcoreapp1.1 RyuJit X64 True 14.11 ms 0.0630 ms 0.0589 ms 1.00 0.00
Blake2bFast netcoreapp2.0 RyuJit X64 Default 14.71 ms 0.0887 ms 0.0830 ms 1.04 0.01
Blake2bFast netcoreapp2.1 RyuJit X64 Default 11.86 ms 0.0676 ms 0.0632 ms 0.84 0.01
Blake2bFast net46 LegacyJit X86 Default 93.65 ms 0.4501 ms 0.3990 ms 1.00 0.01
Blake2bFast netcoreapp1.1 RyuJit X86 True 93.50 ms 0.4214 ms 0.3736 ms 1.00 0.00
Blake2bFast netcoreapp2.0 RyuJit X86 Default 159.46 ms 0.9692 ms 0.8592 ms 1.71 0.01
Blake2bFast netcoreapp2.1 RyuJit X86 Default 16.24 ms 0.0809 ms 0.0757 ms 0.17 0.00
Blake2sFast net46 LegacyJit X64 Default 54.65 ms 0.2645 ms 0.2474 ms 2.47 0.02
Blake2sFast netcoreapp1.1 RyuJit X64 True 22.13 ms 0.2169 ms 0.2029 ms 1.00 0.00
Blake2sFast netcoreapp2.0 RyuJit X64 Default 23.12 ms 0.3318 ms 0.3104 ms 1.04 0.02
Blake2sFast netcoreapp2.1 RyuJit X64 Default 15.62 ms 0.1362 ms 0.1274 ms 0.71 0.01
Blake2sFast net46 LegacyJit X86 Default 71.30 ms 0.1913 ms 0.1597 ms 0.99 0.00
Blake2sFast netcoreapp1.1 RyuJit X86 True 71.72 ms 0.2880 ms 0.2694 ms 1.00 0.00
Blake2sFast netcoreapp2.0 RyuJit X86 Default 37.72 ms 0.2314 ms 0.2165 ms 0.53 0.00
Blake2sFast netcoreapp2.1 RyuJit X86 Default 15.99 ms 0.1289 ms 0.1205 ms 0.22 0.00

There's a small but consistent performance regression that shows up between .NET Core 1.1 and 2.0 on 64-bit and persists in 2.1 if run the scalar version of the code. On 32-bit, the scalar Blake2s implementation got way faster between 1.1 and 2.0 while the Blake2b implementation got way slower.

I'll see if I can create a simpler repro for those regressions. But either way, the intrinsics support is a huge win.

Big thanks to everyone working on this. I'm planning on writing an accelerated JPEG codec if nobody beats me to it and will report back with what I learn from that.

tannergooding commented 6 years ago

Thanks for the numbers @saucecontrol, great work!

Would you happen to also have metrics for the native implementation for comparison? There are still a lot of known minor optimizations/tweaks/work needed for the HWIntrinsics in the JIT, so it would be interesting to determine where we are right now (in comparison).

Would you also mind sharing if you found anything particularly good or particularly painful (etc) about the porting experience, the current API shape, etc?

saucecontrol commented 6 years ago

Oh yeah, I compiled the native code into DLLs and included the PInvoke code in my benchmark app. I meant to include those numbers too. Everything's really close.

Method Platform Mean Error StdDev
Blake2bSseNative X64 11.29 ms 0.0725 ms 0.0643 ms
Blake2bFast X64 11.89 ms 0.1166 ms 0.1091 ms
Blake2bSseNative X86 13.41 ms 0.0668 ms 0.0625 ms
Blake2bFast X86 16.30 ms 0.1278 ms 0.1195 ms
Blake2sSseNative X64 16.12 ms 0.1365 ms 0.1277 ms
Blake2sFast X64 15.80 ms 0.1297 ms 0.1213 ms
Blake2sSseNative X86 16.24 ms 0.0841 ms 0.0786 ms
Blake2sFast X86 16.03 ms 0.0916 ms 0.0857 ms

Those are the timings for hashing 10MiB of data, BTW.

As far as the dev experience, there's a definite learning curve, but once I found the XML comments that include the intel intrinsic mappings like here, it got easier. Those didn't show up in intellisense, but I'm not sure whether that's a problem with the tooling or the nuget package or something else.

And some things are a little clumsy, particularly with things like using _mm_blend_epi16 to blend uint values. I ended up writing a lot of helper methods to StaticCast<U, T>() things back and forth. But really, once those were in place and I had a pattern down, it went smoothly.

Just having access to the intrinsics at all is such a big thing, I can't complain too much about the useability. Keep up the good work. I'm looking forward to seeing what kind of damage I can do once AVX2 is finished :)

tannergooding commented 6 years ago

@saucecontrol, Thanks for taking the time to do this!

Everything's really close.

It's really great to see the numbers are so close (basically within the error/stddev range)

Those didn't show up in intellisense, but I'm not sure whether that's a problem with the tooling or the nuget package or something else

@eerhardt, is this a packing issue with the project?

And some things are a little clumsy, particularly with things like using _mm_blend_epi16 to blend uint values.

Great feedback. This is made significantly easier in native land since the integer types are all combined into one type (__m128i).

Just having access to the intrinsics at all is such a big thing, I can't complain too much about the useability.

Feel free to log issues and tag @CarolEidt, @eerhardt, myself and @fiigii on anything else you think might improve the experience or any issues you find. The API is still in preview which means we still have the opportunity to fix/improve things in the surface area.

saucecontrol commented 6 years ago

It's really great to see the numbers are so close (basically within the error/stddev range)

Absolutely. The C# version has some optimizations that aren't done in the native code, so it's not a 100% even comparison, but it's really encouraging that they're so close already, especially if you already have some plans for things that can be improved.

This is made significantly easier in native land since the integer types are all combined into one type (__m128i).

Yeah, for sure. I think for the masked shuffle-type instructions in particular, it's common to use them with an element size larger than the instruction defines since you can just pair or quad up the mask bits. Having the same vector type for all of those makes it seamless in native. Like I said, it wasn't a huge deal once I got some helper methods written to cast the arguments and then cast the result back, but it's a lot of .NET code for what ends up emitting a single instruction. And the codegen can be a little finicky with combinations of nested inline methods, so in some cases I had to write out the verbose version every time to avoid a performance hit.

The API is still in preview which means we still have the opportunity to fix/improve things in the surface area.

That's great to know, thanks. Aside from the minor inconvenience around the explicit casting, I couldn't be happier with the way it's going. Once I got used to the method naming convention, it really started to grow on me, and I like the way everything is separated by instruction set (SSE/SSE2 overlap weirdness aside).

eerhardt commented 6 years ago

Those didn't show up in intellisense, but I'm not sure whether that's a problem with the tooling or the nuget package or something else

@eerhardt, is this a packing issue with the project?

We aren't packaging any XML doc comments for these APIs. I don't think anyone has written the official doc comments yet. While the mapping to C++ functions is useful, we need better descriptions in the official doc comments.

JulieLeeMSFT commented 4 years ago

Adding @kunalspathak as an owner since it is related to perf test.