AuburnSounds / intel-intrinsics

The Dlang SIMD library
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=MMX,SSE,SSE2,SSE3,SSSE3,SSE4_1
Boost Software License 1.0
68 stars 11 forks source link

Speed up slow ARM intrinsics #45

Closed p0nce closed 3 years ago

p0nce commented 4 years ago

Preliminary support:

p0nce commented 4 years ago

Similar project: https://github.com/simd-everywhere/simde

p0nce commented 4 years ago

One of the first hurdle is emulating MXCSR. What does simde do?

EDIT: simde calls fesetround, a C stdlib call. Which is nice because it will be thread-safe (there is a future LLVM IR intrinsic for that). Can't really emulate this without TLS. sse2neon is incorrect, calling rounding to nearest always instead of looking at a current rounding mode iris has not implement this

p0nce commented 4 years ago

Better way => https://developer.arm.com/documentation/dui0068/b/vector-floating-point-programming/vfp-system-registers/fpscr--the-floating-point-status-and-control-register

p0nce commented 4 years ago

This explain how to link/run an arm64 executable on Apple Silicon => https://github.com/ldc-developers/ldc/issues/3559#issuecomment-691315188

p0nce commented 4 years ago

Decidedly different semantics:

When using NEON:

p0nce commented 4 years ago

Three ways to have float to int using "round to zero". However it's more difficult to use the MXCSR current rounding mode.

import core.simd;
alias __m128 = float4;

public import ldc.llvmasm: __asm;

import ldc.llvmasm;
alias LDCInlineIR = __ir_pure;

// A version of inline IR with prefix/suffix didn't exist before LDC 1.13
alias LDCInlineIREx = __irEx_pure; 

int convertFloatToInt32UsingMXCSR(float value)
{
    return cast(int)value;
}

int convertFloatToInt32UsingMXCSR2(float value)
{
    enum ir = `
            %r = fptosi float %0 to i32
            ret i32 %r`;

        return LDCInlineIR!(ir, int, float)(value);
}

int convertFloatToInt32UsingMXCSR3(float value)
{
    int r;
          __asm!void(`ldr s0, $0
                      fcvtzs w0,s0
                      str w0, $0
                      `, "m", &value);
    return *cast(int*)&value;
}
p0nce commented 3 years ago

Found the solution to lack of ARM intrinsics in the mir-ion source code:

    pragma(LDC_intrinsic, "llvm.aarch64.neon.addp.v16i8")
            __vector(ubyte[16]) __builtin_vpadd_u32(__vector(ubyte[16]), __vector(ubyte[16]));
    }

    version (GNU)
    {
        import gcc.builtins: __builtin_vpadd_u32;
    }
}

version (ARM)
{
    version (LDC)
    {
        pragma(LDC_intrinsic, "llvm.arm.neon.vpaddlu.v8i16.v16i8")
            __vector(ushort[8]) __builtin_vpaddlq_u8(__vector(ubyte[16]));
        pragma(LDC_intrinsic, "llvm.arm.neon.vpaddlu.v4i32.v8i16")
            __vector(uint[4]) __builtin_vpaddlq_u16(__vector(ushort[8]));
        pragma(LDC_intrinsic, "llvm.arm.neon.vpaddlu.v2i64.v4i32")
            __vector(ulong[2]) __builtin_vpaddlq_u32(__vector(uint[4]));
p0nce commented 3 years ago

We accept that in ARM32 intrinsics might be slower. Until there is a need for ARM32 in someone's context