Closed BruceForstall closed 2 years ago
Tagging subscribers to this area: @tannergooding Notify danmosemsft if you want to be subscribed.
I can give this a shot
These ARM64 intrinsics are missing to make the code "worthwhile":
Tracked by #32512: Done!
UnsignedSaturatingSubtract - vqsubq_u8 - UQSUB
Tracked by #33398:
UnsignedShiftRight - vshrq_n_u8 - USHR
UnsignedShiftLeft - vshlq_n_u8 - USHL
Tracked by #1277: Done!
TableLookup - vqtbl4q_u8 - TBL
TableLookupExtension - vqtbx4q_u8 - TBX
These helper methods could be accelerated to help it and are tracked in #33496:
Vector128.Create(byte)
These intrinsics could be introduced to maximise perf:
Load 3 Vector128<T> - vld3q_u8 - LD3
Store 3 Vector128<T> - vst3q_u8 - ST3
Load 4 Vector128<T> - vld4q_u8 - LD4
Store 4 Vector128<T> - vst4q_u8 - ST4
https://github.com/john-h-k/runtime/commit/adeea3662d12c2cc06a241736c34cb6e4f22f446
Draft commit, most that can be done until necessary intrinsics are added
Thanks @john-h-k, as per our conversation on Discord, I'll ping you once the necessary intrinsics are added so you can make further progress.
@tannergooding @john-h-k The shift intrinsics went in with https://github.com/dotnet/runtime/pull/36830. There have been many improvements to System.Runtime.Intrinsics optimization in https://github.com/dotnet/runtime/issues/33496. Is this work now unblocked?
Yes, this should be unblocked now.
@john-h-k Are you planning to try and finish this for .NET 5, now that it's unblocked?
If not, we'll have 3-4 people free to help tackle the remaining issues from https://github.com/dotnet/runtime/issues/33308 on Monday.
CC. @jeffhandley
Yes, I should be. I just need to benchmark the 2 possible impls I've got locally, either on my old Pi2b or I can get a friend to do it on newer device
@tannergooding @jeffhandley What's the plan for this for .NET 5?
@tannergooding is planning to carry this one over the finish line ahead of RC1.
We could not get this completed in time for 5.0.0; moved to 6.0.0.
@a74nh - I am assigning this to you. Feel free to reassign to someone who will work on this.
Edit: For some reason, I am not able to assign this.
Base64 relies heavily on shuffling, so shuffle intrinsics are a prerequisite for this issue. Is shuffling a xplat-intrinsics now? At least on .NET 7 Preview 2 I can't find it.
Is shuffling a xplat-intrinsics now? At least on .NET 7 Preview 2 I can't find it.
I'm actively working on the support locally. Hoping to get it done soon but needing to balance that work against the Generic Math work.
I'm actively working on the support locally.
@a74nh - I am assigning this to you. Feel free to reassign to someone who will work on this.
Edit: For some reason, I am not able to assign this.
I can't assign this either. It should go to @SwapnilGaikwad instead of me.
Base64 relies heavily on shuffling, so shuffle intrinsics are a prerequisite for this issue. Is shuffling a xplat-intrinsics now? At least on .NET 7 Preview 2 I can't find it.
in https://github.com/dotnet/runtime/pull/67192 for hex encoding I introduced a temp helper till https://github.com/dotnet/runtime/issues/63331 is landed 🙂
IMO it's better to wait for https://github.com/dotnet/runtime/issues/63331 Otherwise the code has to be touched again to revert the helper-change.
I'm also going to update https://github.com/gfoidl/Base64, and there by a scheduled CI-trigger fuzz-tests are run, so any fault should get caught early enough for .NET 7 RTM.
Looks like Shuffle
has landed in https://github.com/dotnet/runtime/pull/68559 so this can get started? Note, as called out in #67339 - lot of Base64Encode benchmarks like System.Buffers.Text.Tests.Base64Tests.Base64Encode(NumberOfBytes: 1000)
are up to 16 times slower. So we should fix this sooner to make sure it lands in .NET 7.
@a74nh , @SwapnilGaikwad
Noting that only the Shuffle(x, mask)
API has landed and it still has a few quirks, but should be generally usable especially with constant inputs.
The Shuffle(x, y, mask)
API is going to need https://github.com/dotnet/runtime/pull/68874 (which needs a couple bits of PR feedback handled and a final review).
We are looking at this....
With a recent HEAD, took the ss2 C# code and (mostly) updated it to use Vector128. On Arm64, most things generated into neon code ... but for Vector128.Shuffle() it generated to a function call:
Annotated IR dump of part of the function:
store str
IN0024: 0000B0 str q0, [fp,#144] // [V17 loc10]
Load str
IN0025: 0000B4 ldr q0, [fp,#144] // [V17 loc10]
Vector128.ShiftRightLogical(str.AsInt32(), 4).AsSByte(),
IN0026: 0000B8 ushr v0.4s, v0.4s, #4
Load mask2F
IN0027: 0000BC ldr q1, [fp,#224] // [V11 loc4]
Vector128<sbyte> hiNibbles = Vector128.BitwiseAnd
IN0028: 0000C0 and v0.16b, v0.16b, v1.16b
Store back to hiNibbles
IN0029: 0000C4 str q0, [fp,#128] // [V18 loc11]
Load str (again!)
IN002a: 0000C8 ldr q0, [fp,#144] // [V17 loc10]
Load mask2F (again!)
IN002b: 0000CC ldr q1, [fp,#224] // [V11 loc4]
Vector128<sbyte> loNibbles = Vector128.BitwiseAnd(str, mask2F);
IN002c: 0000D0 and v0.16b, v0.16b, v1.16b
Store loNibbles
IN002d: 0000D4 str q0, [fp,#112] // [V19 loc12]
Load lutHi
IN002e: 0000D8 ldr q0, [fp,#288] // [V07 loc0]
Load hiNibbles
IN002f: 0000DC ldr q1, [fp,#128] // [V18 loc11]
Function call to shuffle
IN0030: 0000E0 movz x0, #0xb7c8
IN0031: 0000E4 movk x0, #0x7e74 LSL #16
IN0032: 0000E8 movk x0, #0xffff LSL #32
IN0033: 0000EC ldr x0, [x0]
IN0034: 0000F0 blr x0
IN0035: 0000F4 str q0, [fp,#96] // [V20 loc13]
Load lutLo
IN0036: 0000F8 ldr q0, [fp,#272] // [V08 loc1]
Load loNibbles
IN0037: 0000FC ldr q1, [fp,#112] // [V19 loc12]
Function call to shuffle
IN0038: 000100 movz x0, #0xb7c8
IN0039: 000104 movk x0, #0x7e74 LSL #16
IN003a: 000108 movk x0, #0xffff LSL #32
IN003b: 00010C ldr x0, [x0]
IN003c: 000110 blr x0
IN003d: 000114 str q0, [fp,#80] // [V21 loc14]
The above dump was using
public static Vector128
On Arm64, most things generated into neon code ... but for Vector128.Shuffle() it generated to a function call
Can you share the managed code that was being used here, preferably including the Shuffle
call and how its arguments are being passed in?
This code was made by taking Ssse3Decode(), replacing Sse2 with Vector128, then fixing up the function calls. Quite possible there's an obvious mistake on my end.
// The JIT won't hoist these "constants", so help it
Vector128<sbyte> lutHi = Vector128.Create(
0x10, 0x10, 0x01, 0x02,
0x04, 0x08, 0x04, 0x08,
0x10, 0x10, 0x10, 0x10,
0x10, 0x10, 0x10, 0x10);
Vector128<sbyte> lutLo = Vector128.Create(
0x15, 0x11, 0x11, 0x11,
0x11, 0x11, 0x11, 0x11,
0x11, 0x11, 0x13, 0x1A,
0x1B, 0x1B, 0x1B, 0x1A);
Vector128<sbyte> lutShift = Vector128.Create(
0, 16, 19, 4,
-65, -65, -71, -71,
0, 0, 0, 0,
0, 0, 0, 0);
Vector128<sbyte> packBytesMask = Vector128.Create(
2, 1, 0, 6,
5, 4, 10, 9,
8, 14, 13, 12,
-1, -1, -1, -1);
Vector128<sbyte> mask2F = Vector128.Create((sbyte)'/');
Vector128<sbyte> mergeConstant0 = Vector128.Create(0x01400140).AsSByte();
Vector128<short> mergeConstant1 = Vector128.Create(0x00011000).AsInt16();
Vector128<sbyte> zero = Vector128<sbyte>.Zero;
byte* src = srcBytes;
byte* dest = destBytes;
Vector128<sbyte> str = Vector128.LoadUnsafe(ref *src).AsSByte();
// lookup
Vector128<sbyte> hiNibbles = Vector128.BitwiseAnd(Vector128.ShiftRightLogical(str.AsInt32(), 4).AsSByte(), mask2F);
Vector128<sbyte> loNibbles = Vector128.BitwiseAnd(str, mask2F);
Vector128<sbyte> hi = Vector128.Shuffle(lutHi, hiNibbles);
Vector128<sbyte> lo = Vector128.Shuffle(lutLo, loNibbles);
Although - Lack of an Arm64 version of shuffle isn't surprising given there is not obvious match to the instruction
From the top of Base64Decoder.cs:
// AVX2 version based on https://github.com/aklomp/base64/tree/e516d769a2a432c08404f1981e73b431566057be/lib/arch/avx2 // SSSE3 version based on https://github.com/aklomp/base64/tree/e516d769a2a432c08404f1981e73b431566057be/lib/arch/ssse3
Plan is to base the Arm64 version on the Neon version from the same repo: https://github.com/aklomp/base64/blob/e516d769a2a432c08404f1981e73b431566057be/lib/arch/neon64/dec_loop.c
Due to this being based on table looks instead of shuffles, there is likely to be no sharing between the X86 and Arm64 version.
there is likely to be no sharing between the X86 and Arm64 version.
And that should be fine.
In order to reuse the aklomp algorithm, there are still missing API functions:
These all require sequential registers - which AIUI is quite a big task to implement.
Can we still do the routine without them?
@a74nh - Just FYI, we have hardware intrinsics APIs exposed under System.Runtime.Intrinsics.Arm.AdvSimd
namespace, in case we can use some of them to optimize and until we add the missing APIs.
https://source.dot.net/#System.Private.CoreLib/AdvSimd.PlatformNotSupported.cs,f54d7472d19d4e2b
TableLookup - vqtbl4q_u8 - TBL TableLookupExtension - vqtbx4q_u8 - TBX
can the byte-versions be used instead here? AdvSimd.VectorTableLookup
and AdvSimd.Arm64.VectorTableLookup
I see the x64 code uses byte indices
can the byte-versions be used instead here?
AdvSimd.VectorTableLookup
andAdvSimd.Arm64.VectorTableLookup
I see the x64 code uses byte indices
Those are the single register version - which we can use but we'll need to use 4 of them. So, doable but not optimal.
An implementation of DecodeFromUtf8 without using LD4/ST3/TBX4 is here: https://github.com/dotnet/runtime/pull/70336
Currently getting 3x slowdown using it. If can't make it faster, then probably best waiting for LD4/ST3/TBX4
This item tracks the conversion of the System.Buffers class to use arm64 intrinsics.
Update for .NET 7 Now that we have cross-platform intrinsics APIs (#49397), these optimizations should be completed using those APIs instead of adding ARM64-specific intrinsics code paths. The effort could optionally include first measuring performance of these methods with the ARM64-specific intrinsics in place, and then measuring the performance of these methods with the cross-platform intrinsics.
Example use of the new cross-platform intrinsics: #63722
Related: #33308