Speed up slow ARM intrinsics

p0nce commented 4 years ago

[x] _mm_hsub_ps
[x] _mm_avg_epu16
[x] _mm_avg_epu8
[x] _mm_cvtpd_epi32
[x] _mm_cvtps_epi32
[x] _mm_madd_epi16
[x] _mm_movemask_epi8
[x] _mm_maskmoveu_si128
[x] _mm_sll_epi32
[x] _mm_sll_epi16
[x] _mm_slli_epi32
[x] _mm_slli_epi64
[x] _mm_sra_epi16
[x] _mm_srl_epi16
[x] _mm_srli_epi16
[x] _mm_srli_epi32
[x] _mm_srli_epi64
[x] _mm_srai_epi16
[x] _mm_srai_epi32
[x] _mm_slli_epi16
[x] _mm_hadd_ps => https://github.com/ldc-developers/ldc/issues/3577
[x] also 8 functions tagged with "#ARM32" (EDIT: top speed not available for ARM32, unless someone has a use for it)

Preliminary support:

[x] _mm_setcsr (we chose not to support FP exception masks and FP exception flags)
[x] _mm_getcsr (we chose not to support FP exception masks and FP exception flags)
[x] convertFloatToInt32UsingMXCSR
[x] convertFloatToInt64UsingMXCSR
[x] convertDoubleToInt32UsingMXCSR
[x] convertDoubleToInt64UsingMXCSR
[x] fix rounding mode forwarding through AArch64 FPCR
[x] look at SSE2 intrinsics for subpar performance, use the tag #ARM then to mark them for rewrite

p0nce commented 4 years ago

One of the first hurdle is emulating MXCSR. What does simde do?

EDIT: simde calls fesetround, a C stdlib call. Which is nice because it will be thread-safe (there is a future LLVM IR intrinsic for that). Can't really emulate this without TLS. sse2neon is incorrect, calling rounding to nearest always instead of looking at a current rounding mode iris has not implement this

p0nce commented 4 years ago

Better way => https://developer.arm.com/documentation/dui0068/b/vector-floating-point-programming/vfp-system-registers/fpscr--the-floating-point-status-and-control-register

p0nce commented 4 years ago

This explain how to link/run an arm64 executable on Apple Silicon => https://github.com/ldc-developers/ldc/issues/3559#issuecomment-691315188

p0nce commented 4 years ago

Decidedly different semantics:

When using NEON:

denormalized numbers are flushed to zero (EDIT: only in 32-bit)
only default NaNs are supported (EDIT only in 32-bit ???)
the Round to Nearest* rounding mode selected (No "current rounding mode" rouding instruction in A64)
untrapped exception handling selected for all floating -point exceptions (EDIT only in 32-bit ???)

p0nce commented 4 years ago

Three ways to have float to int using "round to zero". However it's more difficult to use the MXCSR current rounding mode.

import core.simd;
alias __m128 = float4;

public import ldc.llvmasm: __asm;

import ldc.llvmasm;
alias LDCInlineIR = __ir_pure;

// A version of inline IR with prefix/suffix didn't exist before LDC 1.13
alias LDCInlineIREx = __irEx_pure; 

int convertFloatToInt32UsingMXCSR(float value)
{
    return cast(int)value;
}

int convertFloatToInt32UsingMXCSR2(float value)
{
    enum ir = `
            %r = fptosi float %0 to i32
            ret i32 %r`;

        return LDCInlineIR!(ir, int, float)(value);
}

int convertFloatToInt32UsingMXCSR3(float value)
{
    int r;
          __asm!void(`ldr s0, $0
                      fcvtzs w0,s0
                      str w0, $0
                      `, "m", &value);
    return *cast(int*)&value;
}

p0nce commented 3 years ago

Found the solution to lack of ARM intrinsics in the mir-ion source code:

    pragma(LDC_intrinsic, "llvm.aarch64.neon.addp.v16i8")
            __vector(ubyte[16]) __builtin_vpadd_u32(__vector(ubyte[16]), __vector(ubyte[16]));
    }

    version (GNU)
    {
        import gcc.builtins: __builtin_vpadd_u32;
    }
}

version (ARM)
{
    version (LDC)
    {
        pragma(LDC_intrinsic, "llvm.arm.neon.vpaddlu.v8i16.v16i8")
            __vector(ushort[8]) __builtin_vpaddlq_u8(__vector(ubyte[16]));
        pragma(LDC_intrinsic, "llvm.arm.neon.vpaddlu.v4i32.v8i16")
            __vector(uint[4]) __builtin_vpaddlq_u16(__vector(ushort[8]));
        pragma(LDC_intrinsic, "llvm.arm.neon.vpaddlu.v2i64.v4i32")
            __vector(ulong[2]) __builtin_vpaddlq_u32(__vector(uint[4]));

p0nce commented 3 years ago

We accept that in ARM32 intrinsics might be slower. Until there is a need for ARM32 in someone's context

AuburnSounds / intel-intrinsics

Speed up slow ARM intrinsics #45