Closed cetio closed 1 month ago
LGTM
Dear Sir,
Welcome to upstreamistan. This must have been a long journey there and back. This is a valuable and helpful, albeit massive, change request and I will see that it is merged in due time. I hope you stay for the dessert.
My tentative plan for a Big Merge Event is the following:
Now comes the list of annoyances.
dub.selections.json
emmintrin.d
_mm_setr_epi64x
, well I'm in my open browser on the Guide, and it says it takes __m64
aka long1 to do like in C++ compilers, not long
. If the Guide has stupid names we keep them.#BONUS
tag in comment. else Intel Intrinsics Guide semantics is expected. Though sure in this case it doesn't really hurt.=>
not usable in all supported compilers which may be as ancient ldc-1.24.0, please create a function sorry.types.d
=> LGTM
internals.d
=> LGTM
package.d
=> LGTM
vpopcntdqintrin.d
=> _mm256_popcnt
should have #BONUS
tag, otherwise LGTM, nice use of SAD
avxintrin.d
=> some functions can be pure @safe
even, otherwise LGTM
avx2intrin.d
_mm256_bslli_epi128
crash in dub test --compiler ldc2 -a x86_64 -b unittest-release-inst -f
scope (exit) _mm_clflush(mem_addr);
line is wrong: streaming load are more loose than regular loads. Fences are needed when you use streaming load/store, not when not using them. Think about it.// PERF arm64
or // PERF DMD
passive aggressive comment for the maintainerI hope you stay and keep bringing stuff up.
LGTM
I would appreciate if you don't comment commits you didn't review @HeronErin . This literally deletes 6 intrinsics so it cannot be merged as is.
Will address the issues today probably. Almost everything I've done with emmintrin was really bad and presumably I was under a dark wizard's mind control when I wrote it.
May or may not have erased previous 28 commits with a force push but fixes have been implemented...
I believe the test fails on LDC release were due to some weird shenanigans with inline assembly so I had to comment that out, unfortunate but some time I'll have to look into why that was happening as it also happened for __m256_popcnt
unittest but none others.
Edit
Weird shenanigans may have been VEX instructions wanting me to have the return symbol as the destination, which makes sense in hindsight. Has been fixed alongside a few new changes.
When test fails with optimization and there is assembly, it usually means the assembly was actually wrong and doesn't preserve registers correctly. In many many cases, there is a Inline IR or builtin or sequence of code to avoid the assembly. And yes I'm not sure it even work for all targets of x86 / combination of flags.
I avoid writing D's agnostic inline assembly but if you're aware of a case in which something like
cast(__m256i)__asm!(long4)("
vpermq $2, $1, $0"
, "=v,v,n", a, IMM8);
won't generate properly on LDC with AVX2 then I'll sink some hours into finding a higher level way to do it, presumably with shufflevector
. But personally I'm unaware of such a case and since it's LLVM I would imagine it should generate properly always.
The problem with unittests failing is fixed and I'm guessing it was because optimizations were leading to the first operand being contaminated.
Yes, saw the inline asm changing! It will probably be ok.
Your avx2intrin.d
changes have disappeared from the PR, is this intentional?
Commit history was wiped because I force pushed to master but these changes are in effect:
Add flags for LDC AVX512 and files avx512intrin.d
vpopcntdqintrin.d
vnniintrin.d
Make _mm256_setr_m128*
and _mm256_set1_epi64x
pure
Add _mm256_shuffle_epi8
Add _mm256_bslli_epi128
Add _mm256_bsrli_epi128
Add _mm256_slli_epi128
Add _mm256_srli_epi128
Add _mm_sllv_epi32
Add _mm_sllv_epi64
Add _mm_srlv_epi32
Add _mm_srlv_epi64
Add _mm256_shuffle_epi32
Add _mm256_shufflehi_epi16
Add _mm256_shufflelo_epi16
Add _mm256_popcnt_epi32
Add _mm256_popcnt_epi64
Add _mm256_popcnt
(bonus)
Add _mm256_permute4x64_epi64
Add _mm_dpbusd_epi32
Add _mm_stream_load_si128nt
(bonus)
Add _mm256_stream_load_si256nt
(bonus)
Add _mm256_cvtepi32lo_epi16
(bonus)
Add _mm_dpbusds_epi32
Add _mm_adds_epi32
(bonus)
I've also:
emmintrin
pure
when possible (some AVX512 stuff could be marked a little more strictly)Ah yes my bad. I'm working on something else and will review/merge in the coming week, please hold on.
OK this is merge day, this will be merged piece by piece on master it's easier to review and change that way. Hence this PR will not get pull as is, but the content should be about the same.
EDIT: I'm sorry this stuff makes me angry
// NOTE Why is this not const(**) like _mm256_stream_load_si256?
In this case, Intel has go and changed the signature to void*
since we implemented that, so we're going also for void*
even though it should be const(void)*
They also added _mm_load_si64 to fix _mm_loadu_epi64 weird signature.
/// #BONUS
__m128i _mm_adds_epi32(__m128i a, __m128i b) pure
{
// PERF: ARM64 should use 2x vqadd_s32
static if (LDC_with_saturated_intrinsics)
return cast(__m128i)inteli_llvm_adds!int4(cast(int4)a, cast(int4)b);
else
{
__m128i int_max = _mm_set1_epi32(0x7FFFFFFF);
__m128i res = _mm_add_epi32(a, b);
__m128i sign_bit = _mm_srli_epi32(a, 31);
__m128i sign_xor = _mm_xor_si128(a, b);
__m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a, res));
__m128i saturated = _mm_add_epi32(int_max, sign_bit);
return cast(__m128i) _mm_blendv_ps(cast(__m128)res, // No CT check here
cast(__m128)saturated,
cast(__m128)overflow);
}
}
Note: you can use any intrinsics you want provided that you use same-instruction set or earlier to implement later intrinsics. Because intel-intrinsics guarantee that each intrinsics is as fast as possible whatever the arch and flags, this makes a directed graph of optimal intrinsics. In this cast, you can just use _mm_blendv_ps without concern about if SSE4.1 is there or not (mostly, because sometimes there isn't a simple match either, and inlining needs to be there). All intrinsics are literally always available.
Opened #145 to keep track of all remaining review and merging, it's very detailed work as you've seen
// PERF This is almost definitely not the best way to do this. // Don't quote me on this but I'm pretty sure that there isn't a need to add extra // code for obvious things like CNT == 8 zeroing half of each lane or whatever because // shuffle should be able to complete fast enough that whatever optimizations will likely // lead to negligible performance benefit.
This is a static if
.
auto hi = _mm_slli_si128!CNT(_mm256_extractf128_si256!0(a));
auto lo = _mm_slli_si128!CNT(_mm256_extractf128_si256!1(a));
return _mm256_setr_m128i(hi, lo);
Beware double inversion here:
_mm256_setr_m128i
first take low lane, then high lanehi
is extracted as low lane and lo
as high lane, which is why it works anywayWhen you don't know how an intrinsics should be implemented in LDC, you can look at: https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/avx2intrin.h
For example here: _mm256_bslli_epi128
is using a builtin named __builtin_ia32_pslldqi256_byteshift
which we do not have in D. However, it's sometimes possible to find its trace in LLVM with this file: https://github.com/ldc-developers/llvm-project/blob/ldc-release/18.x/llvm/include/llvm/IR/IntrinsicsX86.td (in which case it would be available with a pragma
intrinsic). Here there is nothing here, so the instruction is probably available with shufflevectorLDC
and a builtin in GDC.
/// #BONUS __m128i _mm_adds_epi32(__m128i a, __m128i b) pure { // PERF: ARM64 should use 2x vqadd_s32 static if (LDC_with_saturated_intrinsics) return cast(__m128i)inteli_llvm_adds!int4(cast(int4)a, cast(int4)b); else { __m128i int_max = _mm_set1_epi32(0x7FFFFFFF); __m128i res = _mm_add_epi32(a, b); __m128i sign_bit = _mm_srli_epi32(a, 31); __m128i sign_xor = _mm_xor_si128(a, b); __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a, res)); __m128i saturated = _mm_add_epi32(int_max, sign_bit); return cast(__m128i) _mm_blendv_ps(cast(__m128)res, // No CT check here cast(__m128)saturated, cast(__m128)overflow); } }
Note: you can use any intrinsics you want provided that you use same-instruction set or earlier to implement later intrinsics. Because intel-intrinsics guarantee that each intrinsics is as fast as possible whatever the arch and flags, this makes a directed graph of optimal intrinsics. In this cast, you can just use _mm_blendv_ps without concern about if SSE4.1 is there or not (mostly, because sometimes there isn't a simple match either, and inlining needs to be there). All intrinsics are literally always available.
The reason I did the static if
is because I'd rather have better control over the operations so I can fine tune optimization and also later it makes it simpler to add AVX512 optimizations, not because I was worried about being unable to access an intrinsic. It's just not always best to do broad operations when you could modularize based on hardware/flags instead.
Yeah I figured shufflevector
could probably be used almost everywhere that I used inline asm, which could probably simplify and make outputs more reliable. I didn't implement it because I figured it was simpler to just use assembly as it should always output fine given the proper flags and I try to make sure the slow path should be optimal. I should have implemented GDC builtins though.
Absolutely. Yes, I think the pros and cons are:
pros of LLVM asm :
-O0
support, very niceshufflevector
and anything making new LLVM IR takes about 50ms more at compile time, in that case it's okay since not instantiated. This is a debug build concern only.cons of LLVM asm:
shufflevector
work for any arch, That one is interesting.
The _mm_srlv_xxx
and _mm_sllv_xxx
are wrong because the instruction and intrinsics have a defined semantic for shift larger or equal to bitness
So you could shift by say, 78 bits.
However when implemented:
__m128i _mm_sllv_epi32(__m128i a, __m128i b) pure @trusted
{
static if (GDC_with_AVX2 || LDC_with_AVX2)
return cast(__m128i)__builtin_ia32_psllv4si(cast(byte16)a, cast(byte16)b);
else
{
return _mm_setr_epi32(
a[0] << b[0],
a[1] << b[1],
a[2] << b[2],
a[3] << b[3]
);
}
}
it uses the << operator which is UB when the shift is > 31 And indeed the results will differ in x86 vs arm. So we have to make this one slower to imitate the instruction semantics.
And indeed look at: https://github.com/simd-everywhere/simde/blob/master/simde/x86/avx2.h#L5009
Done.
This list of changes doesn't factor in what may have been added by other people, like
_m256_blendv_epi8
was added upstream but I had already implemented it. I did try to take from upstream rather than myself when there are conflicts, but this list doesn't account for that nor is my list entirely expansive of all my changes.avx512intrin.d
vpopcntdqintrin.d
_mm256_setr_m128*
and_mm256_set1_epi64x
pure_mm256_shuffle_epi8
_mm256_blendv_epi8
_mm256_bslli_epi128
_mm256_bsrli_epi128
_mm256_slli_epi128
_mm256_srli_epi128
_mm_maskload_epi64
_mm256_maskload_epi32
_mm256_maskload_epi64
_mm_sllv_epi32
_mm_sllv_epi64
_mm_srlv_epi32
_mm_srlv_epi64
_mm256_stream_load_si256
(implementsclflush
for correctness if the intrinsic doesn't exist)_mm256_shuffle_epi32
_mm256_shufflehi_epi16
_mm256_shufflelo_epi16
_mm256_popcnt_epi32
_mm256_popcnt_epi64
_mm256_popcnt
(pseudo-intrinsic)