HW Intrinsics CG unnecessary `vmovaps` when calling `Avx2.GatherMaskVector256`

nietras commented 5 years ago

Playing with Intrinsics.X86 on .NET Core 3.0 Preview 6. I am doing a simple LUT in AVX2 using gather to see how well this performs. E.g. in normal code:

for (int col = 0; col < cols; col++)
{
    dstPtr[col] = lut[srcPtr[col]];
}

And the Avx2 vectorized version:

if (Avx2.IsSupported)
{
    var mask = new Vector256<int>();
    mask = Avx2.CompareEqual(mask, mask);
    for (; (col - Vector128<byte>.Count) < cols; col += Vector128<byte>.Count)
    {
        var srcVec128Byte = Unsafe.ReadUnaligned<Vector128<byte>>(srcRowPtr + col);

        var srcVec256Short = Avx2.ConvertToVector256Int16(srcVec128Byte);

        var srcVec128Short0 = srcVec256Short.GetLower();
        var srcVec128Short1 = srcVec256Short.GetUpper();
        var srcVec256Int0 = Avx2.ConvertToVector256Int32(srcVec128Short0);
        var srcVec256Int1 = Avx2.ConvertToVector256Int32(srcVec128Short1);

        // https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=6017&text=_mm256_mask_i32gather_epi32
        // Below generates unnecessary vmopaps for each gather
        //vmovaps ymm3, ymm0
        //vmovaps ymm4, ymm2
        //vpgatherdd ymm4, dword ptr[rbx + ymm2 * 4],ymm3
        var gathered256Int0 = Avx2.GatherMaskVector256(srcVec256Int0, intLutPtr, srcVec256Int0, mask, 4);
        var gathered256Int1 = Avx2.GatherMaskVector256(srcVec256Int1, intLutPtr, srcVec256Int1, mask, 4);

        var packed256Short = Avx2.PackUnsignedSaturate(gathered256Int0, gathered256Int1);
        var permuted256Short = Avx2.Permute4x64(packed256Short.AsUInt64(), 0xD8).AsInt16();

        // Need https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=6017,2978,2033&text=_mm256_cvtusepi16_epi8
        // but not available
        var gathered256Byte = Avx2.PackUnsignedSaturate(permuted256Short, permuted256Short);
        var permuted256Byte = Avx2.Permute4x64(gathered256Byte.AsUInt64(), 0xD8).AsByte();

        var dstVec = permuted256Byte.GetLower();

        Unsafe.WriteUnaligned(dstRowPtr + col, dstVec);
    }
}

This generates the following assembly:

                           int col = 0;
                           ^^^^^^^^^^^^
M01_L11:
       xor     r8d,r8d
                               var mask = new Vector256<int>();
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       vxorps  ymm0,ymm0,ymm0
                               mask = Avx2.CompareEqual(mask, mask);
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       vpcmpeqd ymm0,ymm0,ymm0
       jmp     M01_L13
M01_L12:
       movsxd  r9,r8d
       shl     r9,1
       vmovupd ymm1,ymmword ptr [rsi+r9]
                                   var srcVec128Short0 = srcVec256Short.GetLower();
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       vmovaps ymm2,ymm1
                                   var srcVec128Short1 = srcVec256Short.GetUpper();
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       vextracti128 xmm1,ymm1,1
                                   var srcVec256Int0 = Avx2.ConvertToVector256Int32(srcVec128Short0);
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       vpmovzxwd ymm2,xmm2
                                   var srcVec256Int1 = Avx2.ConvertToVector256Int32(srcVec128Short1);
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       vpmovzxwd ymm1,xmm1
                                   var gathered256Int0 = Avx2.GatherMaskVector256(mask, intLutPtr, srcVec256Int0, mask, 4);
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       vmovaps ymm3,ymm0
       vmovaps ymm4,ymm0
       vpgatherdd ymm4,dword ptr [rbx+ymm2*4],ymm3
                                   var gathered256Int1 = Avx2.GatherMaskVector256(mask, intLutPtr, srcVec256Int1, mask, 4);
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       vmovaps ymm2,ymm0
       vmovaps ymm3,ymm0
       vpgatherdd ymm3,dword ptr [rbx+ymm1*4],ymm2
                                   var packed256Short = Avx2.PackUnsignedSaturate(gathered256Int0, gathered256Int1);
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       vpackusdw ymm1,ymm4,ymm3
                                   var permuted256Short = Avx2.Permute4x64(packed256Short.AsUInt64(), 0xD8).AsUInt16();
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       vpermq  ymm1,ymm1,0D8h
                                   Unsafe.WriteUnaligned(dstRowPtr + col, permuted256Short);
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       vmovupd ymmword ptr [rax+r9],ymm1
       add     r8d,10h

The vmovaps seem unnecessary.

       vmovaps ymm2,ymm0
       vmovaps ymm3,ymm0

I do get a speedup of 2x on Coffee Lake for this, just noticed the extra vmovapps.

CC: @tannergooding

category:cq theme:register-allocator skill-level:expert cost:medium

tannergooding commented 5 years ago

CC. @CarolEidt as well.

mikedn commented 5 years ago

It looks to me that those moves are required because gather instructions update the mask operand (the last one).

nietras commented 5 years ago

@mikedn ah yes it zeroes the mask, can I avoid the moves by doing the compare equal just after each gather then? Or before of course.

mikedn commented 5 years ago

can I avoid the moves by doing the compare equal just after each gather then? Or before of course.

I don't see what you could gain from that. moves are cheap, compares are not.

tannergooding commented 5 years ago

What might be improvable is that we currently always copy mask into a temporary register: https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2025

We could probably elide that move in the case where mask is last use and therefore it doesn't matter if it is trashed.

The other move can already be elided when targetReg == op1Reg

nietras commented 5 years ago

don't see what you could gain from that. moves are cheap, compares are not.

@mikedn sure good point. But I'm still not sure why there are two moves? (note I made a mistake copying the assembly, it doesn't match the code exactly in the top comment)

vmovaps ymm3, ymm0
vmovaps ymm4, ymm2
vpgatherdd ymm4, dword ptr[rbx + ymm2 * 4],ymm3

ymm3 is the mask. ymm2 is the source, why the vmovaps ymm4, ymm2 when ymm4 will be overwritten? Perhaps I am missing something... 😅

Here the correct asm (before was using mask in two places)

                     int col = 0;
                     ^^^^^^^^^^^^
M01_L11:
 xor     ecx,ecx
                         var mask = new Vector256<int>();
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vxorps  ymm0,ymm0,ymm0
                         mask = Avx2.CompareEqual(mask, mask);
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vpcmpeqd ymm0,ymm0,ymm0
 jmp     M01_L13
M01_L12:
 movsxd  r8,ecx
 vmovupd xmm1,xmmword ptr [rsi+r8]
                             var srcVec256Short = Avx2.ConvertToVector256Int16(srcVec128Byte);
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vpmovzxbw ymm1,xmm1
                             var srcVec128Short0 = srcVec256Short.GetLower();
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vmovaps ymm2,ymm1
                             var srcVec128Short1 = srcVec256Short.GetUpper();
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vextracti128 xmm1,ymm1,1
                             var srcVec256Int0 = Avx2.ConvertToVector256Int32(srcVec128Short0);
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vpmovsxwd ymm2,xmm2
                             var srcVec256Int1 = Avx2.ConvertToVector256Int32(srcVec128Short1);
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vpmovsxwd ymm1,xmm1
                             var gathered256Int0 = Avx2.GatherMaskVector256(srcVec256Int0, intLutPtr, srcVec256Int0, mask, 4);
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vmovaps ymm3,ymm0
 vmovaps ymm4,ymm2
 vpgatherdd ymm4,dword ptr [rbx+ymm2*4],ymm3
                             var gathered256Int1 = Avx2.GatherMaskVector256(srcVec256Int1, intLutPtr, srcVec256Int1, mask, 4);
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vmovaps ymm2,ymm0
 vmovaps ymm3,ymm1
 vpgatherdd ymm3,dword ptr [rbx+ymm1*4],ymm2
                             var packed256Short = Avx2.PackUnsignedSaturate(gathered256Int0, gathered256Int1);
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vpackusdw ymm1,ymm4,ymm3
                             var permuted256Short = Avx2.Permute4x64(packed256Short.AsUInt64(), 0xD8).AsInt16();
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vpermq  ymm1,ymm1,0D8h
                             var gathered256Byte = Avx2.PackUnsignedSaturate(permuted256Short, permuted256Short);
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vpackuswb ymm1,ymm1,ymm1
                             var permuted256Byte = Avx2.Permute4x64(gathered256Byte.AsUInt64(), 0xD8).AsByte();
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vpermq  ymm1,ymm1,0D8h
                             Unsafe.WriteUnaligned(dstRowPtr + col, dstVec);
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 movsxd  r8,ecx
 vmovupd xmmword ptr [rax+r8],xmm1
 add     ecx,10h

nietras commented 5 years ago

might be improvable is that we currently always copy mask into a temporary register

@tannergooding sorry, this is in a loop, so it makes sense on the last too, doesn't it?

other move can already be elided when targetReg == op1Reg

Can you expand on this? Below does not help.

srcVec256Int0 = Avx2.GatherMaskVector256(srcVec256Int0, intLutPtr, srcVec256Int0, mask, 4);

tannergooding commented 5 years ago

It depends on what the register allocator decides.

Basically the codegen for the 5 operand overload has 3 steps:

It currently unconditionally emits a movaps to preserve op4Reg (the mask parameter): https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2025
It conditionally emits a movaps to ensure that the target register has the correct state for the instruction (this only happens if targetReg and op1Reg (the source parameter) aren't the same): https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2030
It always emits the underlying instruction: https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2086

The first step could be elided if we knew that mask didn't need to be preserved (it is lastUse of that value). The second step will be elided if the register allocator decides that targetReg and op1Reg can be the same (@CarolEidt would need to comment on if there is something better we can do here).

nietras commented 5 years ago

It conditionally emits a movaps to ensure that the target register has the correct state for the instruction (this only happens if targetReg and op1Reg (the source parameter) aren't the same)

@tannergooding thanks for the explanation. Just to be sure I understand, so the second vmovaps should be possible to get rid off? Yet when the targetReg == op1Reg this generates extra vmovaps e.g.

                                   srcVec256Int0 = Avx2.GatherMaskVector256(srcVec256Int0, intLutPtr, srcVec256Int0, mask, 4);
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       vmovaps ymm3,ymm0
       vmovaps ymm4,ymm2
       vpgatherdd ymm4,dword ptr [rbx+ymm2*4],ymm3
       vmovaps ymm2,ymm4
                                   srcVec256Int1 = Avx2.GatherMaskVector256(srcVec256Int1, intLutPtr, srcVec256Int1, mask, 4);
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       vmovaps ymm3,ymm0
       vmovaps ymm4,ymm1
       vpgatherdd ymm4,dword ptr [rbx+ymm1*4],ymm3
       vmovaps ymm1,ymm4

the problem of course is that not only the same register is used for op1reg and op3reg perhaps?

nietras commented 5 years ago

problem of course is that not only the same register is used for op1reg and op3reg perhaps?

From the link to the code gen code, this doesn't seem to be an issue, if I understand it correctly.

nietras commented 5 years ago

the codegen for the 5 operand overload

@tannergooding right, I am perhaps using the wrong overload here with mask being all ones for this start code. Wanted to see code gen with mask support, though.

Using the 3 operand overload this falls back to vpcmpeqd as I would have assumed:

     var gathered256Int0 = Avx2.GatherVector256(intLutPtr, srcVec256Int0, 4);
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vpcmpeqd ymm2,ymm2,ymm2
 vpgatherdd ymm3,dword ptr [rbx+ymm1*4],ymm2
     var gathered256Int1 = Avx2.GatherVector256(intLutPtr, srcVec256Int1, 4);
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 vpcmpeqd ymm1,ymm1,ymm1
 vpgatherdd ymm2,dword ptr [rbx+ymm0*4],ymm1

and no extra vmovaps.

tannergooding commented 5 years ago

so the second vmovaps should be possible to get rid off?

Yes, should be possible. And some trivial (but not at all real-world) samples show that it will be (namely, I can see it elided in some non-optimized code).

CarolEidt commented 5 years ago

The first step could be elided if we knew that mask didn't need to be preserved (it is lastUse of that value)

It would be nice to be able to preference the internal temp register to the incoming mask value, but the liveness model of the register allocator doesn't allow that. It models the internal registers as being defined prior to the end of the live range of the incoming values. There may be a better way to approach it, but I can't think of it off the top of my head. Even if it's a last use, the code generator would then have to ensure that it doesn't conflict with the target.

The second step will be elided if the register allocator decides that targetReg and op1Reg can be the same

In this case, we should be able to preference targetReg to op1Reg. This means that this line: https://github.com/dotnet/coreclr/blob/master/src/jit/lsraxarch.cpp#L2651 and this line: https://github.com/dotnet/coreclr/blob/master/src/jit/lsraxarch.cpp#L2667

Would have to do something like this:

    if (op1->isContained())
    {
        srcCount += BuildOperandUses(op1);
    }
    else
    {
        tgtPrefUse = BuildUse(op1);
        srcCount++;
    }

CarolEidt commented 5 years ago

An alternative approach to dealing with the mask register would be to define it as a second target, and to preference it to the incoming mask. That said, multi-reg instructions are still somewhat problematic in the JIT, and it may require special handling because it will generally (always) be an unused value.

nietras commented 5 years ago

@tannergooding @caroleidt feel free to close this issue, only one of the two movaps was extra, the other was my bad, and overall they have little perf consequences for my case. Just noticed them, and thought I'd ask. 😃

tannergooding commented 5 years ago

I was planning on keeping it open since there are some potential improvements to be made here, even if minor.

It can always be marked up for grabs and someone interested could experiment.

nietras commented 5 years ago

I love the new intrinsics btw works great 👍

CarolEidt commented 5 years ago

Thanks @nietras !

I agree with @tannergooding that it's worth keeping this open to capture the preferencing improvement opportunities.

nietras commented 5 years ago

worth keeping

Ok, l could rather quickly create a BenchmarkDotNet benchmark with the code in question, so let me know if this is needed.

nietras commented 5 years ago

@tannergooding a quick question, is VPMOVZXBD not available? In fact the whole zero extended move are they not available?

tannergooding commented 5 years ago

Vector128<int> Sse41.ConvertToVector128Int32(Vector128<byte>) emits PMOVZXBD xmm, xmm. Use Vector128<int> Sse41.ConvertToVector128Int32(byte*) for the overload that deals with addresses.

tannergooding commented 5 years ago

(or the same but Avx2.ConvertToVector256Int32 for the overloads that deal with ymm)

CarolEidt commented 5 years ago

FWIW when I can't remember where an intrinsic is declared and/or what it's called, I just grep for the instruction name under the coreclr\src\System.Private.CoreLib\shared\System\Runtime\Intrinsics.

nietras commented 5 years ago

Avx2.ConvertToVector256Int32 for the overloads that deal with ymm

@tannergooding thanks!

just grep for the instruction name

@CarolEidt that's actually what I kind of tried by going to definition for Avx2 and searching there. It doesn't always work though since not all overloads have the necessary <summary> e.g. I cannot find _mm256_cvtepu8_epi32. It would help with discoverability if all overloads would have a summary. I understand that that is probably a lot of work though.

Ideally, the summary would contain both the "closest" intrinsic name (if any) and the raw instruction name like they have now e.g. below Vector256<int> ConvertToVector256Int32(Vector128<short> value) has a summary, but the others do not. So when searching in the go to definition file you can't find exactly what you are looking for. And since I was unsure about naming I apparently got a little lost. Generally, the names are very good. They make sense, and sometimes the overloads are better than what intrinsics provide, I think. 👍

        //
        // Summary:
        //     __m256i _mm256_cvtepi16_epi32 (__m128i a) VPMOVSXWD ymm, xmm/m128
        public static Vector256<int> ConvertToVector256Int32(Vector128<short> value);
        public static Vector256<int> ConvertToVector256Int32(Vector128<byte> value);
        public static Vector256<int> ConvertToVector256Int32(byte* address);

tannergooding commented 5 years ago

@nietras, they all should have the native intrinsic name and the corresponding native instruction as a minimum.

For example, _mm_cvtepu8_epi32 is right here: https://source.dot.net/#System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Avx2.PlatformNotSupported.cs,690

tannergooding commented 5 years ago

(the equivalent doc comment is also in Avx2.cs, but it doesn't appear to be indexed on source.dot.net right now)

nietras commented 5 years ago

equivalent doc comment is also in Avx2

@tannergooding weird, have you tried to go to definition with .NET Core 3.0 Preview 6, it doesn't show up, as my copy paste from the metadata shows...

tannergooding commented 5 years ago

have you tried to go to definition with .NET Core 3.0 Preview 6

I imagine that has something to do with the reference assemblies and intellisense documentation being out of sync.

CC. @carlossanlop

nietras commented 5 years ago

@tannergooding I am getting a weird NullReferenceException in the following code:

        [TestMethod]
        public unsafe void NullReferenceException()
        {
            var ptr = stackalloc byte[32 * Vector256<int>.Count];
            var vec = Avx2.ConvertToVector256Int32(ptr + 0 * Vector256<int>.Count); // throws
        }

but the following code does not throw:

        [TestMethod]
        public unsafe void NotNullReferenceException()
        {
            var ptr = stackalloc byte[32 * Vector256<int>.Count];
            var ptr0 = ptr + 0 * Vector256<int>.Count;
            var vec = Avx2.ConvertToVector256Int32(ptr0);
        }

saucecontrol commented 5 years ago

@nietras that issue is fixed in master https://github.com/dotnet/coreclr/pull/25135

dotnet-policy-service[bot] commented 7 months ago

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.

dotnet-policy-service[bot] commented 6 months ago

This issue will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.

dotnet / runtime

HW Intrinsics CG unnecessary `vmovaps` when calling `Avx2.GatherMaskVector256` #12945