Closed nietras closed 6 months ago
CC. @CarolEidt as well.
It looks to me that those moves are required because gather instructions update the mask operand (the last one).
@mikedn ah yes it zeroes the mask, can I avoid the moves by doing the compare equal just after each gather then? Or before of course.
can I avoid the moves by doing the compare equal just after each gather then? Or before of course.
I don't see what you could gain from that. moves are cheap, compares are not.
What might be improvable is that we currently always copy mask
into a temporary register: https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2025
We could probably elide that move in the case where mask
is last use and therefore it doesn't matter if it is trashed.
The other move can already be elided when targetReg == op1Reg
don't see what you could gain from that. moves are cheap, compares are not.
@mikedn sure good point. But I'm still not sure why there are two moves? (note I made a mistake copying the assembly, it doesn't match the code exactly in the top comment)
vmovaps ymm3, ymm0
vmovaps ymm4, ymm2
vpgatherdd ymm4, dword ptr[rbx + ymm2 * 4],ymm3
ymm3
is the mask. ymm2
is the source, why the vmovaps ymm4, ymm2
when ymm4
will be overwritten? Perhaps I am missing something... 😅
Here the correct asm (before was using mask
in two places)
int col = 0;
^^^^^^^^^^^^
M01_L11:
xor ecx,ecx
var mask = new Vector256<int>();
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vxorps ymm0,ymm0,ymm0
mask = Avx2.CompareEqual(mask, mask);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vpcmpeqd ymm0,ymm0,ymm0
jmp M01_L13
M01_L12:
movsxd r8,ecx
vmovupd xmm1,xmmword ptr [rsi+r8]
var srcVec256Short = Avx2.ConvertToVector256Int16(srcVec128Byte);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vpmovzxbw ymm1,xmm1
var srcVec128Short0 = srcVec256Short.GetLower();
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vmovaps ymm2,ymm1
var srcVec128Short1 = srcVec256Short.GetUpper();
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vextracti128 xmm1,ymm1,1
var srcVec256Int0 = Avx2.ConvertToVector256Int32(srcVec128Short0);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vpmovsxwd ymm2,xmm2
var srcVec256Int1 = Avx2.ConvertToVector256Int32(srcVec128Short1);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vpmovsxwd ymm1,xmm1
var gathered256Int0 = Avx2.GatherMaskVector256(srcVec256Int0, intLutPtr, srcVec256Int0, mask, 4);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vmovaps ymm3,ymm0
vmovaps ymm4,ymm2
vpgatherdd ymm4,dword ptr [rbx+ymm2*4],ymm3
var gathered256Int1 = Avx2.GatherMaskVector256(srcVec256Int1, intLutPtr, srcVec256Int1, mask, 4);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vmovaps ymm2,ymm0
vmovaps ymm3,ymm1
vpgatherdd ymm3,dword ptr [rbx+ymm1*4],ymm2
var packed256Short = Avx2.PackUnsignedSaturate(gathered256Int0, gathered256Int1);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vpackusdw ymm1,ymm4,ymm3
var permuted256Short = Avx2.Permute4x64(packed256Short.AsUInt64(), 0xD8).AsInt16();
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vpermq ymm1,ymm1,0D8h
var gathered256Byte = Avx2.PackUnsignedSaturate(permuted256Short, permuted256Short);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vpackuswb ymm1,ymm1,ymm1
var permuted256Byte = Avx2.Permute4x64(gathered256Byte.AsUInt64(), 0xD8).AsByte();
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vpermq ymm1,ymm1,0D8h
Unsafe.WriteUnaligned(dstRowPtr + col, dstVec);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
movsxd r8,ecx
vmovupd xmmword ptr [rax+r8],xmm1
add ecx,10h
might be improvable is that we currently always copy mask into a temporary register
@tannergooding sorry, this is in a loop, so it makes sense on the last too, doesn't it?
other move can already be elided when targetReg == op1Reg
Can you expand on this? Below does not help.
srcVec256Int0 = Avx2.GatherMaskVector256(srcVec256Int0, intLutPtr, srcVec256Int0, mask, 4);
It depends on what the register allocator decides.
Basically the codegen for the 5 operand overload has 3 steps:
movaps
to preserve op4Reg
(the mask
parameter): https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2025movaps
to ensure that the target register has the correct state for the instruction (this only happens if targetReg
and op1Reg
(the source
parameter) aren't the same): https://github.com/dotnet/coreclr/blob/master/src/jit/hwintrinsiccodegenxarch.cpp#L2030The first step could be elided if we knew that mask
didn't need to be preserved (it is lastUse
of that value).
The second step will be elided if the register allocator decides that targetReg
and op1Reg
can be the same (@CarolEidt would need to comment on if there is something better we can do here).
It conditionally emits a movaps to ensure that the target register has the correct state for the instruction (this only happens if targetReg and op1Reg (the source parameter) aren't the same)
@tannergooding thanks for the explanation. Just to be sure I understand, so the second vmovaps
should be possible to get rid off? Yet when the targetReg
== op1Reg
this generates extra vmovaps
e.g.
srcVec256Int0 = Avx2.GatherMaskVector256(srcVec256Int0, intLutPtr, srcVec256Int0, mask, 4);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vmovaps ymm3,ymm0
vmovaps ymm4,ymm2
vpgatherdd ymm4,dword ptr [rbx+ymm2*4],ymm3
vmovaps ymm2,ymm4
srcVec256Int1 = Avx2.GatherMaskVector256(srcVec256Int1, intLutPtr, srcVec256Int1, mask, 4);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vmovaps ymm3,ymm0
vmovaps ymm4,ymm1
vpgatherdd ymm4,dword ptr [rbx+ymm1*4],ymm3
vmovaps ymm1,ymm4
the problem of course is that not only the same register is used for op1reg
and op3reg
perhaps?
problem of course is that not only the same register is used for op1reg and op3reg perhaps?
From the link to the code gen code, this doesn't seem to be an issue, if I understand it correctly.
the codegen for the 5 operand overload
@tannergooding right, I am perhaps using the wrong overload here with mask being all ones for this start code. Wanted to see code gen with mask support, though.
Using the 3 operand overload this falls back to vpcmpeqd
as I would have assumed:
var gathered256Int0 = Avx2.GatherVector256(intLutPtr, srcVec256Int0, 4);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vpcmpeqd ymm2,ymm2,ymm2
vpgatherdd ymm3,dword ptr [rbx+ymm1*4],ymm2
var gathered256Int1 = Avx2.GatherVector256(intLutPtr, srcVec256Int1, 4);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vpcmpeqd ymm1,ymm1,ymm1
vpgatherdd ymm2,dword ptr [rbx+ymm0*4],ymm1
and no extra vmovaps
.
so the second vmovaps should be possible to get rid off?
Yes, should be possible. And some trivial (but not at all real-world) samples show that it will be (namely, I can see it elided in some non-optimized code).
The first step could be elided if we knew that mask didn't need to be preserved (it is lastUse of that value)
It would be nice to be able to preference the internal temp register to the incoming mask value, but the liveness model of the register allocator doesn't allow that. It models the internal registers as being defined prior to the end of the live range of the incoming values. There may be a better way to approach it, but I can't think of it off the top of my head. Even if it's a last use, the code generator would then have to ensure that it doesn't conflict with the target.
The second step will be elided if the register allocator decides that targetReg and op1Reg can be the same
In this case, we should be able to preference targetReg
to op1Reg
. This means that this line: https://github.com/dotnet/coreclr/blob/master/src/jit/lsraxarch.cpp#L2651 and this line: https://github.com/dotnet/coreclr/blob/master/src/jit/lsraxarch.cpp#L2667
Would have to do something like this:
if (op1->isContained())
{
srcCount += BuildOperandUses(op1);
}
else
{
tgtPrefUse = BuildUse(op1);
srcCount++;
}
An alternative approach to dealing with the mask register would be to define it as a second target, and to preference it to the incoming mask. That said, multi-reg instructions are still somewhat problematic in the JIT, and it may require special handling because it will generally (always) be an unused value.
@tannergooding @caroleidt feel free to close this issue, only one of the two movaps was extra, the other was my bad, and overall they have little perf consequences for my case. Just noticed them, and thought I'd ask. 😃
I was planning on keeping it open since there are some potential improvements to be made here, even if minor.
It can always be marked up for grabs and someone interested could experiment.
I love the new intrinsics btw works great 👍
Thanks @nietras !
I agree with @tannergooding that it's worth keeping this open to capture the preferencing improvement opportunities.
worth keeping
Ok, l could rather quickly create a BenchmarkDotNet benchmark with the code in question, so let me know if this is needed.
@tannergooding a quick question, is VPMOVZXBD
not available? In fact the whole zero extended move are they not available?
Vector128<int> Sse41.ConvertToVector128Int32(Vector128<byte>)
emits PMOVZXBD xmm, xmm
. Use Vector128<int> Sse41.ConvertToVector128Int32(byte*)
for the overload that deals with addresses.
(or the same but Avx2.ConvertToVector256Int32
for the overloads that deal with ymm
)
FWIW when I can't remember where an intrinsic is declared and/or what it's called, I just grep for the instruction name under the coreclr\src\System.Private.CoreLib\shared\System\Runtime\Intrinsics
.
Avx2.ConvertToVector256Int32 for the overloads that deal with ymm
@tannergooding thanks!
just grep for the instruction name
@CarolEidt that's actually what I kind of tried by going to definition for Avx2
and searching there. It doesn't always work though since not all overloads have the necessary <summary>
e.g. I cannot find _mm256_cvtepu8_epi32
. It would help with discoverability if all overloads would have a summary. I understand that that is probably a lot of work though.
Ideally, the summary would contain both the "closest" intrinsic name (if any) and the raw instruction name like they have now e.g. below Vector256<int> ConvertToVector256Int32(Vector128<short> value)
has a summary, but the others do not. So when searching in the go to definition file you can't find exactly what you are looking for. And since I was unsure about naming I apparently got a little lost. Generally, the names are very good. They make sense, and sometimes the overloads are better than what intrinsics provide, I think. 👍
//
// Summary:
// __m256i _mm256_cvtepi16_epi32 (__m128i a) VPMOVSXWD ymm, xmm/m128
public static Vector256<int> ConvertToVector256Int32(Vector128<short> value);
public static Vector256<int> ConvertToVector256Int32(Vector128<byte> value);
public static Vector256<int> ConvertToVector256Int32(byte* address);
@nietras, they all should have the native intrinsic name and the corresponding native instruction as a minimum.
For example, _mm_cvtepu8_epi32
is right here: https://source.dot.net/#System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Avx2.PlatformNotSupported.cs,690
(the equivalent doc comment is also in Avx2.cs
, but it doesn't appear to be indexed on source.dot.net right now)
equivalent doc comment is also in Avx2
@tannergooding weird, have you tried to go to definition with .NET Core 3.0 Preview 6, it doesn't show up, as my copy paste from the metadata shows...
have you tried to go to definition with .NET Core 3.0 Preview 6
I imagine that has something to do with the reference assemblies and intellisense documentation being out of sync.
CC. @carlossanlop
@tannergooding I am getting a weird NullReferenceException
in the following code:
[TestMethod]
public unsafe void NullReferenceException()
{
var ptr = stackalloc byte[32 * Vector256<int>.Count];
var vec = Avx2.ConvertToVector256Int32(ptr + 0 * Vector256<int>.Count); // throws
}
but the following code does not throw:
[TestMethod]
public unsafe void NotNullReferenceException()
{
var ptr = stackalloc byte[32 * Vector256<int>.Count];
var ptr0 = ptr + 0 * Vector256<int>.Count;
var vec = Avx2.ConvertToVector256Int32(ptr0);
}
@nietras that issue is fixed in master https://github.com/dotnet/coreclr/pull/25135
Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.
This process is part of our issue cleanup automation.
This issue will now be closed since it had been marked no-recent-activity
but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.
Playing with Intrinsics.X86 on .NET Core 3.0 Preview 6. I am doing a simple LUT in AVX2 using gather to see how well this performs. E.g. in normal code:
And the Avx2 vectorized version:
This generates the following assembly:
The
vmovaps
seem unnecessary.I do get a speedup of 2x on Coffee Lake for this, just noticed the extra
vmovapps
.CC: @tannergooding
category:cq theme:register-allocator skill-level:expert cost:medium