Optimize System.Buffers for arm64 using cross-platform intrinsics

BruceForstall commented 4 years ago

This item tracks the conversion of the System.Buffers class to use arm64 intrinsics.

[x] System.Buffers.Text.Base64.DecodeFromUtf8()
[ ] System.Buffers.Text.Base64.EncodeToUtf8()

Update for .NET 7 Now that we have cross-platform intrinsics APIs (#49397), these optimizations should be completed using those APIs instead of adding ARM64-specific intrinsics code paths. The effort could optionally include first measuring performance of these methods with the ARM64-specific intrinsics in place, and then measuring the performance of these methods with the cross-platform intrinsics.

Example use of the new cross-platform intrinsics: #63722

Related: #33308

ghost commented 4 years ago

Tagging subscribers to this area: @tannergooding Notify danmosemsft if you want to be subscribed.

john-h-k commented 4 years ago

I can give this a shot

john-h-k commented 4 years ago

These ARM64 intrinsics are missing to make the code "worthwhile":

~~Tracked by #32512:~~ Done!

UnsignedSaturatingSubtract - vqsubq_u8 - UQSUB

Tracked by #33398:

UnsignedShiftRight - vshrq_n_u8 - USHR
UnsignedShiftLeft - vshlq_n_u8 - USHL

~~Tracked by #1277:~~ Done!

TableLookup - vqtbl4q_u8 - TBL
TableLookupExtension - vqtbx4q_u8 - TBX

These helper methods could be accelerated to help it and are tracked in #33496:

Vector128.Create(byte)

These intrinsics could be introduced to maximise perf:

Load 3 Vector128<T> - vld3q_u8 - LD3
Store 3 Vector128<T> - vst3q_u8 - ST3
Load 4 Vector128<T> - vld4q_u8 - LD4
Store 4 Vector128<T> - vst4q_u8 - ST4

john-h-k commented 4 years ago

https://github.com/john-h-k/runtime/commit/adeea3662d12c2cc06a241736c34cb6e4f22f446

Draft commit, most that can be done until necessary intrinsics are added

tannergooding commented 4 years ago

Thanks @john-h-k, as per our conversation on Discord, I'll ping you once the necessary intrinsics are added so you can make further progress.

BruceForstall commented 4 years ago

@tannergooding @john-h-k The shift intrinsics went in with https://github.com/dotnet/runtime/pull/36830. There have been many improvements to System.Runtime.Intrinsics optimization in https://github.com/dotnet/runtime/issues/33496. Is this work now unblocked?

tannergooding commented 4 years ago

Yes, this should be unblocked now.

BruceForstall commented 4 years ago

@john-h-k Are you planning to try and finish this for .NET 5, now that it's unblocked?

tannergooding commented 4 years ago

If not, we'll have 3-4 people free to help tackle the remaining issues from https://github.com/dotnet/runtime/issues/33308 on Monday.

CC. @jeffhandley

john-h-k commented 4 years ago

Yes, I should be. I just need to benchmark the 2 possible impls I've got locally, either on my old Pi2b or I can get a friend to do it on newer device

BruceForstall commented 4 years ago

@tannergooding @jeffhandley What's the plan for this for .NET 5?

jeffhandley commented 4 years ago

@tannergooding is planning to carry this one over the finish line ahead of RC1.

jeffhandley commented 4 years ago

We could not get this completed in time for 5.0.0; moved to 6.0.0.

kunalspathak commented 2 years ago

@a74nh - I am assigning this to you. Feel free to reassign to someone who will work on this.

Edit: For some reason, I am not able to assign this.

gfoidl commented 2 years ago

Base64 relies heavily on shuffling, so shuffle intrinsics are a prerequisite for this issue. Is shuffling a xplat-intrinsics now? At least on .NET 7 Preview 2 I can't find it.

tannergooding commented 2 years ago

Is shuffling a xplat-intrinsics now? At least on .NET 7 Preview 2 I can't find it.

I'm actively working on the support locally. Hoping to get it done soon but needing to balance that work against the Generic Math work.

kunalspathak commented 2 years ago

I'm actively working on the support locally.

https://github.com/dotnet/runtime/issues/63331

a74nh commented 2 years ago

@a74nh - I am assigning this to you. Feel free to reassign to someone who will work on this.

Edit: For some reason, I am not able to assign this.

I can't assign this either. It should go to @SwapnilGaikwad instead of me.

EgorBo commented 2 years ago

Base64 relies heavily on shuffling, so shuffle intrinsics are a prerequisite for this issue. Is shuffling a xplat-intrinsics now? At least on .NET 7 Preview 2 I can't find it.

in https://github.com/dotnet/runtime/pull/67192 for hex encoding I introduced a temp helper till https://github.com/dotnet/runtime/issues/63331 is landed 🙂

gfoidl commented 2 years ago

IMO it's better to wait for https://github.com/dotnet/runtime/issues/63331 Otherwise the code has to be touched again to revert the helper-change.

I'm also going to update https://github.com/gfoidl/Base64, and there by a scheduled CI-trigger fuzz-tests are run, so any fault should get caught early enough for .NET 7 RTM.

kunalspathak commented 2 years ago

Looks like Shuffle has landed in https://github.com/dotnet/runtime/pull/68559 so this can get started? Note, as called out in #67339 - lot of Base64Encode benchmarks like System.Buffers.Text.Tests.Base64Tests.Base64Encode(NumberOfBytes: 1000) are up to 16 times slower. So we should fix this sooner to make sure it lands in .NET 7.

@a74nh , @SwapnilGaikwad

tannergooding commented 2 years ago

Noting that only the Shuffle(x, mask) API has landed and it still has a few quirks, but should be generally usable especially with constant inputs.

The Shuffle(x, y, mask) API is going to need https://github.com/dotnet/runtime/pull/68874 (which needs a couple bits of PR feedback handled and a final review).

a74nh commented 2 years ago

We are looking at this....

With a recent HEAD, took the ss2 C# code and (mostly) updated it to use Vector128. On Arm64, most things generated into neon code ... but for Vector128.Shuffle() it generated to a function call:

Annotated IR dump of part of the function:

store str
IN0024: 0000B0                    str     q0, [fp,#144] // [V17 loc10]

Load str
IN0025: 0000B4                    ldr     q0, [fp,#144] // [V17 loc10]

Vector128.ShiftRightLogical(str.AsInt32(), 4).AsSByte(),
IN0026: 0000B8                    ushr    v0.4s, v0.4s, #4

Load mask2F
IN0027: 0000BC                    ldr     q1, [fp,#224] // [V11 loc4]

Vector128<sbyte> hiNibbles = Vector128.BitwiseAnd
IN0028: 0000C0                    and     v0.16b, v0.16b, v1.16b

Store back to hiNibbles
IN0029: 0000C4                    str     q0, [fp,#128] // [V18 loc11]

Load str (again!)
IN002a: 0000C8                    ldr     q0, [fp,#144] // [V17 loc10]

Load mask2F (again!)
IN002b: 0000CC                    ldr     q1, [fp,#224] // [V11 loc4]

Vector128<sbyte> loNibbles = Vector128.BitwiseAnd(str, mask2F);
IN002c: 0000D0                    and     v0.16b, v0.16b, v1.16b

Store loNibbles
IN002d: 0000D4                    str     q0, [fp,#112] // [V19 loc12]

Load lutHi
IN002e: 0000D8                    ldr     q0, [fp,#288] // [V07 loc0]

Load hiNibbles
IN002f: 0000DC                    ldr     q1, [fp,#128] // [V18 loc11]

Function call to shuffle
IN0030: 0000E0                    movz    x0, #0xb7c8
IN0031: 0000E4                    movk    x0, #0x7e74 LSL #16
IN0032: 0000E8                    movk    x0, #0xffff LSL #32
IN0033: 0000EC                    ldr     x0, [x0]
IN0034: 0000F0                    blr     x0
IN0035: 0000F4                    str     q0, [fp,#96]  // [V20 loc13]

Load lutLo
IN0036: 0000F8                    ldr     q0, [fp,#272] // [V08 loc1]

Load loNibbles
IN0037: 0000FC                    ldr     q1, [fp,#112] // [V19 loc12]

Function call to shuffle
IN0038: 000100                    movz    x0, #0xb7c8
IN0039: 000104                    movk    x0, #0x7e74 LSL #16
IN003a: 000108                    movk    x0, #0xffff LSL #32
IN003b: 00010C                    ldr     x0, [x0]
IN003c: 000110                    blr     x0
IN003d: 000114                    str     q0, [fp,#80]  // [V21 loc14]

a74nh commented 2 years ago

The above dump was using public static Vector128 Shuffle(Vector128 vector, Vector128 indices)

tannergooding commented 2 years ago

On Arm64, most things generated into neon code ... but for Vector128.Shuffle() it generated to a function call

Can you share the managed code that was being used here, preferably including the Shuffle call and how its arguments are being passed in?

a74nh commented 2 years ago

This code was made by taking Ssse3Decode(), replacing Sse2 with Vector128, then fixing up the function calls. Quite possible there's an obvious mistake on my end.

        // The JIT won't hoist these "constants", so help it
        Vector128<sbyte> lutHi = Vector128.Create(
            0x10, 0x10, 0x01, 0x02,
            0x04, 0x08, 0x04, 0x08,
            0x10, 0x10, 0x10, 0x10,
            0x10, 0x10, 0x10, 0x10);

        Vector128<sbyte> lutLo = Vector128.Create(
            0x15, 0x11, 0x11, 0x11,
            0x11, 0x11, 0x11, 0x11,
            0x11, 0x11, 0x13, 0x1A,
            0x1B, 0x1B, 0x1B, 0x1A);

        Vector128<sbyte> lutShift = Vector128.Create(
            0, 16, 19, 4,
            -65, -65, -71, -71,
            0, 0, 0, 0,
            0, 0, 0, 0);

        Vector128<sbyte> packBytesMask = Vector128.Create(
            2, 1, 0, 6,
            5, 4, 10, 9,
            8, 14, 13, 12,
            -1, -1, -1, -1);

        Vector128<sbyte> mask2F = Vector128.Create((sbyte)'/');
        Vector128<sbyte> mergeConstant0 = Vector128.Create(0x01400140).AsSByte();
        Vector128<short> mergeConstant1 = Vector128.Create(0x00011000).AsInt16();
        Vector128<sbyte> zero = Vector128<sbyte>.Zero;

        byte* src =  srcBytes;
        byte* dest = destBytes;

            Vector128<sbyte> str = Vector128.LoadUnsafe(ref *src).AsSByte();

            // lookup
            Vector128<sbyte> hiNibbles = Vector128.BitwiseAnd(Vector128.ShiftRightLogical(str.AsInt32(), 4).AsSByte(), mask2F);
            Vector128<sbyte> loNibbles = Vector128.BitwiseAnd(str, mask2F);
            Vector128<sbyte> hi = Vector128.Shuffle(lutHi, hiNibbles);
            Vector128<sbyte> lo = Vector128.Shuffle(lutLo, loNibbles);

a74nh commented 2 years ago

Although - Lack of an Arm64 version of shuffle isn't surprising given there is not obvious match to the instruction

a74nh commented 2 years ago

From the top of Base64Decoder.cs:

// AVX2 version based on https://github.com/aklomp/base64/tree/e516d769a2a432c08404f1981e73b431566057be/lib/arch/avx2 // SSSE3 version based on https://github.com/aklomp/base64/tree/e516d769a2a432c08404f1981e73b431566057be/lib/arch/ssse3

Plan is to base the Arm64 version on the Neon version from the same repo: https://github.com/aklomp/base64/blob/e516d769a2a432c08404f1981e73b431566057be/lib/arch/neon64/dec_loop.c

Due to this being based on table looks instead of shuffles, there is likely to be no sharing between the X86 and Arm64 version.

kunalspathak commented 2 years ago

there is likely to be no sharing between the X86 and Arm64 version.

And that should be fine.

a74nh commented 2 years ago

In order to reuse the aklomp algorithm, there are still missing API functions:

TableLookup - vqtbl4q_u8 - TBL
TableLookupExtension - vqtbx4q_u8 - TBX
- Implementation proposed in https://github.com/dotnet/runtime/issues/1277
Store 3 Vector128 - vst3q_u8 - ST3
Load 4 Vector128 - vld4q_u8 - LD4

These all require sequential registers - which AIUI is quite a big task to implement.

Can we still do the routine without them?

The tbl4q instructions can be broken down to 4 tbx instructions + some extra indexing.
the LD4 and ST3 could be replaced by multiple LD and ST instructions. But would then require manual interleaving. I think that can be done using zip/unzp.
Or, it's possible may be able to remove the interleaving requirement from the algorithm.
These changes would make the function non-optimal, but it should still be faster than the sequential version.

kunalspathak commented 2 years ago

@a74nh - Just FYI, we have hardware intrinsics APIs exposed under System.Runtime.Intrinsics.Arm.AdvSimd namespace, in case we can use some of them to optimize and until we add the missing APIs.

https://source.dot.net/#System.Private.CoreLib/AdvSimd.PlatformNotSupported.cs,f54d7472d19d4e2b

EgorBo commented 2 years ago

TableLookup - vqtbl4q_u8 - TBL TableLookupExtension - vqtbx4q_u8 - TBX

can the byte-versions be used instead here? AdvSimd.VectorTableLookup and AdvSimd.Arm64.VectorTableLookup

I see the x64 code uses byte indices

a74nh commented 2 years ago

can the byte-versions be used instead here? AdvSimd.VectorTableLookup and AdvSimd.Arm64.VectorTableLookup

I see the x64 code uses byte indices

Those are the single register version - which we can use but we'll need to use 4 of them. So, doable but not optimal.

a74nh commented 2 years ago

An implementation of DecodeFromUtf8 without using LD4/ST3/TBX4 is here: https://github.com/dotnet/runtime/pull/70336

Currently getting 3x slowdown using it. If can't make it faster, then probably best waiting for LD4/ST3/TBX4

kunalspathak commented 2 years ago

Closed by https://github.com/dotnet/runtime/pull/70654

dotnet / runtime

Optimize System.Buffers for arm64 using cross-platform intrinsics #35033