AuburnSounds / intel-intrinsics

The Dlang SIMD library
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=MMX,SSE,SSE2,SSE3,SSSE3,SSE4_1
Boost Software License 1.0
68 stars 11 forks source link

would you rather have infinite gold coins but you have dig a treasure chest in a public park every day youve made a transaction with gold or infinite silver coins but you can only drink wine and bathe in olive oil like the romans #144

Closed cetio closed 1 month ago

cetio commented 1 month ago

This list of changes doesn't factor in what may have been added by other people, like _m256_blendv_epi8 was added upstream but I had already implemented it. I did try to take from upstream rather than myself when there are conflicts, but this list doesn't account for that nor is my list entirely expansive of all my changes.

  1. Add flags for LDC AVX512 and files avx512intrin.d vpopcntdqintrin.d
  2. Make _mm256_setr_m128* and _mm256_set1_epi64x pure
  3. Add _mm256_shuffle_epi8
  4. Add _mm256_blendv_epi8
  5. Add _mm256_bslli_epi128
  6. Add _mm256_bsrli_epi128
  7. Add _mm256_slli_epi128
  8. Add _mm256_srli_epi128
  9. Add _mm_maskload_epi64
  10. Add _mm256_maskload_epi32
  11. Add _mm256_maskload_epi64
  12. Add _mm_sllv_epi32
  13. Add _mm_sllv_epi64
  14. Add _mm_srlv_epi32
  15. Add _mm_srlv_epi64
  16. Add _mm256_stream_load_si256 (implements clflush for correctness if the intrinsic doesn't exist)
  17. Add _mm256_shuffle_epi32
  18. Add _mm256_shufflehi_epi16
  19. Add _mm256_shufflelo_epi16
  20. Add _mm256_popcnt_epi32
  21. Add _mm256_popcnt_epi64
  22. Add _mm256_popcnt (pseudo-intrinsic)
HeronErin commented 1 month ago

LGTM

p0nce commented 1 month ago

Dear Sir,

Welcome to upstreamistan. This must have been a long journey there and back. This is a valuable and helpful, albeit massive, change request and I will see that it is merged in due time. I hope you stay for the dessert.

My tentative plan for a Big Merge Event is the following:

Now comes the list of annoyances.

emmintrin.d

types.d => LGTM internals.d=> LGTM package.d => LGTM vpopcntdqintrin.d=> _mm256_popcnt should have #BONUS tag, otherwise LGTM, nice use of SAD avxintrin.d => some functions can be pure @safe even, otherwise LGTM

avx2intrin.d

I hope you stay and keep bringing stuff up.

p0nce commented 1 month ago

LGTM

I would appreciate if you don't comment commits you didn't review @HeronErin . This literally deletes 6 intrinsics so it cannot be merged as is.

cetio commented 1 month ago

Will address the issues today probably. Almost everything I've done with emmintrin was really bad and presumably I was under a dark wizard's mind control when I wrote it.

cetio commented 1 month ago

May or may not have erased previous 28 commits with a force push but fixes have been implemented...

I believe the test fails on LDC release were due to some weird shenanigans with inline assembly so I had to comment that out, unfortunate but some time I'll have to look into why that was happening as it also happened for __m256_popcnt unittest but none others.

Edit

Weird shenanigans may have been VEX instructions wanting me to have the return symbol as the destination, which makes sense in hindsight. Has been fixed alongside a few new changes.

p0nce commented 1 month ago

When test fails with optimization and there is assembly, it usually means the assembly was actually wrong and doesn't preserve registers correctly. In many many cases, there is a Inline IR or builtin or sequence of code to avoid the assembly. And yes I'm not sure it even work for all targets of x86 / combination of flags.

cetio commented 1 month ago

I avoid writing D's agnostic inline assembly but if you're aware of a case in which something like

cast(__m256i)__asm!(long4)("
    vpermq $2, $1, $0"
, "=v,v,n", a, IMM8);

won't generate properly on LDC with AVX2 then I'll sink some hours into finding a higher level way to do it, presumably with shufflevector. But personally I'm unaware of such a case and since it's LLVM I would imagine it should generate properly always.

The problem with unittests failing is fixed and I'm guessing it was because optimizations were leading to the first operand being contaminated.

p0nce commented 1 month ago

Yes, saw the inline asm changing! It will probably be ok.

p0nce commented 1 month ago

Your avx2intrin.d changes have disappeared from the PR, is this intentional?

cetio commented 1 month ago

Commit history was wiped because I force pushed to master but these changes are in effect:

  1. Add flags for LDC AVX512 and files avx512intrin.d vpopcntdqintrin.d vnniintrin.d

  2. Make _mm256_setr_m128* and _mm256_set1_epi64x pure

  3. Add _mm256_shuffle_epi8

  4. Add _mm256_bslli_epi128

  5. Add _mm256_bsrli_epi128

  6. Add _mm256_slli_epi128

  7. Add _mm256_srli_epi128

  8. Add _mm_sllv_epi32

  9. Add _mm_sllv_epi64

  10. Add _mm_srlv_epi32

  11. Add _mm_srlv_epi64

  12. Add _mm256_shuffle_epi32

  13. Add _mm256_shufflehi_epi16

  14. Add _mm256_shufflelo_epi16

  15. Add _mm256_popcnt_epi32

  16. Add _mm256_popcnt_epi64

  17. Add _mm256_popcnt (bonus)

  18. Add _mm256_permute4x64_epi64

  19. Add _mm_dpbusd_epi32

  20. Add _mm_stream_load_si128nt (bonus)

  21. Add _mm256_stream_load_si256nt (bonus)

  22. Add _mm256_cvtepi32lo_epi16 (bonus)

  23. Add _mm_dpbusds_epi32

  24. Add _mm_adds_epi32 (bonus)

I've also:

  1. Done a little optimization
  2. Added #BONUS like you said for sequences that aren't actually intrinsic
  3. Reverted anything that isn't on this list like with emmintrin
  4. Added some attributes like pure when possible (some AVX512 stuff could be marked a little more strictly)
  5. Made sure changes from upstream aren't lost (as far as I'm aware)
  6. Added more comments about performance and whatnot for clarity.
  7. Miscellaneous fixes like obviously with the unittest failures on release LDC.
p0nce commented 1 month ago

Ah yes my bad. I'm working on something else and will review/merge in the coming week, please hold on.

p0nce commented 1 month ago

OK this is merge day, this will be merged piece by piece on master it's easier to review and change that way. Hence this PR will not get pull as is, but the content should be about the same.

EDIT: I'm sorry this stuff makes me angry

p0nce commented 1 month ago

// NOTE Why is this not const(**) like _mm256_stream_load_si256?

In this case, Intel has go and changed the signature to void* since we implemented that, so we're going also for void* even though it should be const(void)* They also added _mm_load_si64 to fix _mm_loadu_epi64 weird signature.

p0nce commented 1 month ago
/// #BONUS
__m128i _mm_adds_epi32(__m128i a, __m128i b) pure
{
    // PERF: ARM64 should use 2x vqadd_s32
    static if (LDC_with_saturated_intrinsics)
        return cast(__m128i)inteli_llvm_adds!int4(cast(int4)a, cast(int4)b);
    else
    {
        __m128i int_max = _mm_set1_epi32(0x7FFFFFFF);
        __m128i res = _mm_add_epi32(a, b);
        __m128i sign_bit = _mm_srli_epi32(a, 31);
        __m128i sign_xor  = _mm_xor_si128(a, b);
        __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a, res));
        __m128i saturated = _mm_add_epi32(int_max, sign_bit);
        return cast(__m128i) _mm_blendv_ps(cast(__m128)res,  // No CT check here
            cast(__m128)saturated, 
            cast(__m128)overflow);
    }
}

Note: you can use any intrinsics you want provided that you use same-instruction set or earlier to implement later intrinsics. Because intel-intrinsics guarantee that each intrinsics is as fast as possible whatever the arch and flags, this makes a directed graph of optimal intrinsics. In this cast, you can just use _mm_blendv_ps without concern about if SSE4.1 is there or not (mostly, because sometimes there isn't a simple match either, and inlining needs to be there). All intrinsics are literally always available.

p0nce commented 1 month ago

Opened #145 to keep track of all remaining review and merging, it's very detailed work as you've seen

p0nce commented 1 month ago

// PERF This is almost definitely not the best way to do this. // Don't quote me on this but I'm pretty sure that there isn't a need to add extra // code for obvious things like CNT == 8 zeroing half of each lane or whatever because // shuffle should be able to complete fast enough that whatever optimizations will likely // lead to negligible performance benefit.

This is a static if.

p0nce commented 1 month ago
auto hi = _mm_slli_si128!CNT(_mm256_extractf128_si256!0(a));
auto lo = _mm_slli_si128!CNT(_mm256_extractf128_si256!1(a));
return _mm256_setr_m128i(hi, lo);

Beware double inversion here:

p0nce commented 1 month ago

When you don't know how an intrinsics should be implemented in LDC, you can look at: https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/avx2intrin.h

For example here: _mm256_bslli_epi128 is using a builtin named __builtin_ia32_pslldqi256_byteshift which we do not have in D. However, it's sometimes possible to find its trace in LLVM with this file: https://github.com/ldc-developers/llvm-project/blob/ldc-release/18.x/llvm/include/llvm/IR/IntrinsicsX86.td (in which case it would be available with a pragma intrinsic). Here there is nothing here, so the instruction is probably available with shufflevectorLDC and a builtin in GDC.

cetio commented 1 month ago
/// #BONUS
__m128i _mm_adds_epi32(__m128i a, __m128i b) pure
{
    // PERF: ARM64 should use 2x vqadd_s32
    static if (LDC_with_saturated_intrinsics)
        return cast(__m128i)inteli_llvm_adds!int4(cast(int4)a, cast(int4)b);
    else
    {
        __m128i int_max = _mm_set1_epi32(0x7FFFFFFF);
        __m128i res = _mm_add_epi32(a, b);
        __m128i sign_bit = _mm_srli_epi32(a, 31);
        __m128i sign_xor  = _mm_xor_si128(a, b);
        __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a, res));
        __m128i saturated = _mm_add_epi32(int_max, sign_bit);
        return cast(__m128i) _mm_blendv_ps(cast(__m128)res,  // No CT check here
            cast(__m128)saturated, 
            cast(__m128)overflow);
    }
}

Note: you can use any intrinsics you want provided that you use same-instruction set or earlier to implement later intrinsics. Because intel-intrinsics guarantee that each intrinsics is as fast as possible whatever the arch and flags, this makes a directed graph of optimal intrinsics. In this cast, you can just use _mm_blendv_ps without concern about if SSE4.1 is there or not (mostly, because sometimes there isn't a simple match either, and inlining needs to be there). All intrinsics are literally always available.

The reason I did the static if is because I'd rather have better control over the operations so I can fine tune optimization and also later it makes it simpler to add AVX512 optimizations, not because I was worried about being unable to access an intrinsic. It's just not always best to do broad operations when you could modularize based on hardware/flags instead.

cetio commented 1 month ago

Yeah I figured shufflevector could probably be used almost everywhere that I used inline asm, which could probably simplify and make outputs more reliable. I didn't implement it because I figured it was simpler to just use assembly as it should always output fine given the proper flags and I try to make sure the slow path should be optimal. I should have implemented GDC builtins though.

p0nce commented 1 month ago

Absolutely. Yes, I think the pros and cons are:

pros of LLVM asm :

cons of LLVM asm:

p0nce commented 1 month ago

That one is interesting.

The _mm_srlv_xxx and _mm_sllv_xxx are wrong because the instruction and intrinsics have a defined semantic for shift larger or equal to bitness image

So you could shift by say, 78 bits.

However when implemented:

__m128i _mm_sllv_epi32(__m128i a, __m128i b) pure @trusted
{
    static if (GDC_with_AVX2 || LDC_with_AVX2)
        return cast(__m128i)__builtin_ia32_psllv4si(cast(byte16)a, cast(byte16)b);
    else
    {
        return _mm_setr_epi32(
            a[0] << b[0],
            a[1] << b[1],
            a[2] << b[2],
            a[3] << b[3]
        );
    }
}

it uses the << operator which is UB when the shift is > 31 image And indeed the results will differ in x86 vs arm. So we have to make this one slower to imitate the instruction semantics.

And indeed look at: https://github.com/simd-everywhere/simde/blob/master/simde/x86/avx2.h#L5009

p0nce commented 1 month ago

Done.