Open 4creators opened 6 years ago
@4creators - thanks for creating this issue. I would love to see some performance tests.
I've ported the SSE4.1 implementation of the Blake2 hashing algorithms from here using the new intrinsics support. The results are impressive, particularly on 32-bit.
Here's some benchmark results comparing the SSE version with the scalar version across a bunch of JIT versions. Benchmark code is here.
My setup
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.17134
Intel Xeon CPU E3-1505M v6 3.00GHz, 1 CPU, 8 logical and 4 physical cores
Frequency=2929692 Hz, Resolution=341.3328 ns, Timer=TSC
[Host] : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3101.0
net46 : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit LegacyJIT/clrjit-v4.7.3101.0;compatjit-v4.7.3101.0
netcoreapp1.1 : .NET Core 1.1.8 (CoreCLR 4.6.26328.01, CoreFX 4.6.24705.01), 64bit RyuJIT
netcoreapp2.0 : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT
netcoreapp2.1 : .NET Core 2.1.0-rc1 (CoreCLR 4.6.26426.02, CoreFX 4.6.26426.04), 64bit RyuJIT
And the results. The netcoreapp2.1
runs here are the SSE code.
Method | Job | Jit | Platform | IsBaseline | Mean | Error | StdDev | Scaled | ScaledSD |
---|---|---|---|---|---|---|---|---|---|
Blake2bFast | net46 | LegacyJit | X64 | Default | 29.32 ms | 0.1556 ms | 0.1379 ms | 2.08 | 0.01 |
Blake2bFast | netcoreapp1.1 | RyuJit | X64 | True | 14.11 ms | 0.0630 ms | 0.0589 ms | 1.00 | 0.00 |
Blake2bFast | netcoreapp2.0 | RyuJit | X64 | Default | 14.71 ms | 0.0887 ms | 0.0830 ms | 1.04 | 0.01 |
Blake2bFast | netcoreapp2.1 | RyuJit | X64 | Default | 11.86 ms | 0.0676 ms | 0.0632 ms | 0.84 | 0.01 |
Blake2bFast | net46 | LegacyJit | X86 | Default | 93.65 ms | 0.4501 ms | 0.3990 ms | 1.00 | 0.01 |
Blake2bFast | netcoreapp1.1 | RyuJit | X86 | True | 93.50 ms | 0.4214 ms | 0.3736 ms | 1.00 | 0.00 |
Blake2bFast | netcoreapp2.0 | RyuJit | X86 | Default | 159.46 ms | 0.9692 ms | 0.8592 ms | 1.71 | 0.01 |
Blake2bFast | netcoreapp2.1 | RyuJit | X86 | Default | 16.24 ms | 0.0809 ms | 0.0757 ms | 0.17 | 0.00 |
Blake2sFast | net46 | LegacyJit | X64 | Default | 54.65 ms | 0.2645 ms | 0.2474 ms | 2.47 | 0.02 |
Blake2sFast | netcoreapp1.1 | RyuJit | X64 | True | 22.13 ms | 0.2169 ms | 0.2029 ms | 1.00 | 0.00 |
Blake2sFast | netcoreapp2.0 | RyuJit | X64 | Default | 23.12 ms | 0.3318 ms | 0.3104 ms | 1.04 | 0.02 |
Blake2sFast | netcoreapp2.1 | RyuJit | X64 | Default | 15.62 ms | 0.1362 ms | 0.1274 ms | 0.71 | 0.01 |
Blake2sFast | net46 | LegacyJit | X86 | Default | 71.30 ms | 0.1913 ms | 0.1597 ms | 0.99 | 0.00 |
Blake2sFast | netcoreapp1.1 | RyuJit | X86 | True | 71.72 ms | 0.2880 ms | 0.2694 ms | 1.00 | 0.00 |
Blake2sFast | netcoreapp2.0 | RyuJit | X86 | Default | 37.72 ms | 0.2314 ms | 0.2165 ms | 0.53 | 0.00 |
Blake2sFast | netcoreapp2.1 | RyuJit | X86 | Default | 15.99 ms | 0.1289 ms | 0.1205 ms | 0.22 | 0.00 |
There's a small but consistent performance regression that shows up between .NET Core 1.1 and 2.0 on 64-bit and persists in 2.1 if run the scalar version of the code. On 32-bit, the scalar Blake2s implementation got way faster between 1.1 and 2.0 while the Blake2b implementation got way slower.
I'll see if I can create a simpler repro for those regressions. But either way, the intrinsics support is a huge win.
Big thanks to everyone working on this. I'm planning on writing an accelerated JPEG codec if nobody beats me to it and will report back with what I learn from that.
Thanks for the numbers @saucecontrol, great work!
Would you happen to also have metrics for the native implementation for comparison? There are still a lot of known minor optimizations/tweaks/work needed for the HWIntrinsics in the JIT, so it would be interesting to determine where we are right now (in comparison).
Would you also mind sharing if you found anything particularly good or particularly painful (etc) about the porting experience, the current API shape, etc?
Oh yeah, I compiled the native code into DLLs and included the PInvoke code in my benchmark app. I meant to include those numbers too. Everything's really close.
Method | Platform | Mean | Error | StdDev |
---|---|---|---|---|
Blake2bSseNative | X64 | 11.29 ms | 0.0725 ms | 0.0643 ms |
Blake2bFast | X64 | 11.89 ms | 0.1166 ms | 0.1091 ms |
Blake2bSseNative | X86 | 13.41 ms | 0.0668 ms | 0.0625 ms |
Blake2bFast | X86 | 16.30 ms | 0.1278 ms | 0.1195 ms |
Blake2sSseNative | X64 | 16.12 ms | 0.1365 ms | 0.1277 ms |
Blake2sFast | X64 | 15.80 ms | 0.1297 ms | 0.1213 ms |
Blake2sSseNative | X86 | 16.24 ms | 0.0841 ms | 0.0786 ms |
Blake2sFast | X86 | 16.03 ms | 0.0916 ms | 0.0857 ms |
Those are the timings for hashing 10MiB of data, BTW.
As far as the dev experience, there's a definite learning curve, but once I found the XML comments that include the intel intrinsic mappings like here, it got easier. Those didn't show up in intellisense, but I'm not sure whether that's a problem with the tooling or the nuget package or something else.
And some things are a little clumsy, particularly with things like using _mm_blend_epi16
to blend uint
values. I ended up writing a lot of helper methods to StaticCast<U, T>()
things back and forth. But really, once those were in place and I had a pattern down, it went smoothly.
Just having access to the intrinsics at all is such a big thing, I can't complain too much about the useability. Keep up the good work. I'm looking forward to seeing what kind of damage I can do once AVX2 is finished :)
@saucecontrol, Thanks for taking the time to do this!
Everything's really close.
It's really great to see the numbers are so close (basically within the error/stddev range)
Those didn't show up in intellisense, but I'm not sure whether that's a problem with the tooling or the nuget package or something else
@eerhardt, is this a packing issue with the project?
And some things are a little clumsy, particularly with things like using _mm_blend_epi16 to blend uint values.
Great feedback. This is made significantly easier in native land since the integer types are all combined into one type (__m128i).
Just having access to the intrinsics at all is such a big thing, I can't complain too much about the useability.
Feel free to log issues and tag @CarolEidt, @eerhardt, myself and @fiigii on anything else you think might improve the experience or any issues you find. The API is still in preview which means we still have the opportunity to fix/improve things in the surface area.
It's really great to see the numbers are so close (basically within the error/stddev range)
Absolutely. The C# version has some optimizations that aren't done in the native code, so it's not a 100% even comparison, but it's really encouraging that they're so close already, especially if you already have some plans for things that can be improved.
This is made significantly easier in native land since the integer types are all combined into one type (__m128i).
Yeah, for sure. I think for the masked shuffle-type instructions in particular, it's common to use them with an element size larger than the instruction defines since you can just pair or quad up the mask bits. Having the same vector type for all of those makes it seamless in native. Like I said, it wasn't a huge deal once I got some helper methods written to cast the arguments and then cast the result back, but it's a lot of .NET code for what ends up emitting a single instruction. And the codegen can be a little finicky with combinations of nested inline methods, so in some cases I had to write out the verbose version every time to avoid a performance hit.
The API is still in preview which means we still have the opportunity to fix/improve things in the surface area.
That's great to know, thanks. Aside from the minor inconvenience around the explicit casting, I couldn't be happier with the way it's going. Once I got used to the method naming convention, it really started to grow on me, and I like the way everything is separated by instruction set (SSE/SSE2 overlap weirdness aside).
Those didn't show up in intellisense, but I'm not sure whether that's a problem with the tooling or the nuget package or something else
@eerhardt, is this a packing issue with the project?
We aren't packaging any XML doc comments for these APIs. I don't think anyone has written the official doc comments yet. While the mapping to C++ functions is useful, we need better descriptions in the official doc comments.
Adding @kunalspathak as an owner since it is related to perf test.
HW intrinsics project is approaching a point where it will be possible to write real life code using available ISAs. This should allow to validate intrinsics implementation using complex functionality like cryptographic, media and scientific algorithms/applications implementation.
Possible candidates include: SHA-3 Keccak algorithm, JPEG 2000 / JPEG algorithms, FFT, some Regex algorithms and many others.
This issue could be used for discussion on performance tests and their implementation.
@CarolEidt @fiigii @mikedn @sdmaclea @tannergooding @AndyAyersMS
category:testing theme:hardware-intrinsics skill-level:intermediate cost:medium