Closed p0nce closed 3 years ago
Similar project: https://github.com/simd-everywhere/simde
One of the first hurdle is emulating MXCSR. What does simde do?
EDIT:
simde
calls fesetround
, a C stdlib call. Which is nice because it will be thread-safe (there is a future LLVM IR intrinsic for that). Can't really emulate this without TLS.
sse2neon
is incorrect, calling rounding to nearest always instead of looking at a current rounding mode
iris
has not implement this
This explain how to link/run an arm64 executable on Apple Silicon => https://github.com/ldc-developers/ldc/issues/3559#issuecomment-691315188
Decidedly different semantics:
When using NEON:
Three ways to have float to int using "round to zero". However it's more difficult to use the MXCSR current rounding mode.
import core.simd;
alias __m128 = float4;
public import ldc.llvmasm: __asm;
import ldc.llvmasm;
alias LDCInlineIR = __ir_pure;
// A version of inline IR with prefix/suffix didn't exist before LDC 1.13
alias LDCInlineIREx = __irEx_pure;
int convertFloatToInt32UsingMXCSR(float value)
{
return cast(int)value;
}
int convertFloatToInt32UsingMXCSR2(float value)
{
enum ir = `
%r = fptosi float %0 to i32
ret i32 %r`;
return LDCInlineIR!(ir, int, float)(value);
}
int convertFloatToInt32UsingMXCSR3(float value)
{
int r;
__asm!void(`ldr s0, $0
fcvtzs w0,s0
str w0, $0
`, "m", &value);
return *cast(int*)&value;
}
Found the solution to lack of ARM intrinsics in the mir-ion source code:
pragma(LDC_intrinsic, "llvm.aarch64.neon.addp.v16i8")
__vector(ubyte[16]) __builtin_vpadd_u32(__vector(ubyte[16]), __vector(ubyte[16]));
}
version (GNU)
{
import gcc.builtins: __builtin_vpadd_u32;
}
}
version (ARM)
{
version (LDC)
{
pragma(LDC_intrinsic, "llvm.arm.neon.vpaddlu.v8i16.v16i8")
__vector(ushort[8]) __builtin_vpaddlq_u8(__vector(ubyte[16]));
pragma(LDC_intrinsic, "llvm.arm.neon.vpaddlu.v4i32.v8i16")
__vector(uint[4]) __builtin_vpaddlq_u16(__vector(ushort[8]));
pragma(LDC_intrinsic, "llvm.arm.neon.vpaddlu.v2i64.v4i32")
__vector(ulong[2]) __builtin_vpaddlq_u32(__vector(uint[4]));
We accept that in ARM32 intrinsics might be slower. Until there is a need for ARM32 in someone's context
_mm_hsub_ps
_mm_avg_epu16
_mm_avg_epu8
_mm_cvtpd_epi32
_mm_cvtps_epi32
_mm_madd_epi16
_mm_movemask_epi8
_mm_maskmoveu_si128
_mm_sll_epi32
_mm_sll_epi16
_mm_slli_epi32
_mm_slli_epi64
_mm_sra_epi16
_mm_srl_epi16
_mm_srli_epi16
_mm_srli_epi32
_mm_srli_epi64
_mm_srai_epi16
_mm_srai_epi32
_mm_slli_epi16
_mm_hadd_ps
=> https://github.com/ldc-developers/ldc/issues/3577Preliminary support: