DLTcollab / sse2neon

A translator from Intel SSE intrinsics to Arm/Aarch64 NEON implementation
MIT License
1.3k stars 208 forks source link
aarch64 apple-silicon arm arm64 armv7l armv8 armv8-a biilabs intel-intrinsics intel-sse-intrinsics neon neon-intrinsics simd sse sse-intrinsics sse2neon x86

sse2neon

Github Actions

A C/C++ header file that converts Intel SSE intrinsics to Arm/Aarch64 NEON intrinsics.

Introduction

sse2neon is a translator of Intel SSE (Streaming SIMD Extensions) intrinsics to Arm NEON, shortening the time needed to get an Arm working program that then can be used to extract profiles and to identify hot paths in the code. The header file sse2neon.h contains several of the functions provided by Intel intrinsic headers such as <xmmintrin.h>, only implemented with NEON-based counterparts to produce the exact semantics of the intrinsics.

Mapping and Coverage

Header file Extension
<mmintrin.h> MMX
<xmmintrin.h> SSE
<emmintrin.h> SSE2
<pmmintrin.h> SSE3
<tmmintrin.h> SSSE3
<smmintrin.h> SSE4.1
<nmmintrin.h> SSE4.2
<wmmintrin.h> AES

sse2neon aims to support SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 and AES extension.

In order to deliver NEON-equivalent intrinsics for all SSE intrinsics used widely, please be aware that some SSE intrinsics exist a direct mapping with a concrete NEON-equivalent intrinsic. Others, unfortunately, lack a 1:1 mapping, meaning that their equivalents are built utilizing a number of NEON intrinsics.

For example, SSE intrinsic _mm_loadu_si128 has a direct NEON mapping (vld1q_s32), but SSE intrinsic _mm_maddubs_epi16 has to be implemented with 13+ NEON instructions.

Floating-point compatibility

Some conversions require several NEON intrinsics, which may produce inconsistent results compared to their SSE counterparts due to differences in the arithmetic rules of IEEE-754.

Taking a possible conversion of _mm_rsqrt_ps as example:

__m128 _mm_rsqrt_ps(__m128 in)
{
    float32x4_t out = vrsqrteq_f32(vreinterpretq_f32_m128(in));

    out = vmulq_f32(
        out, vrsqrtsq_f32(vmulq_f32(vreinterpretq_f32_m128(in), out), out));

    return vreinterpretq_m128_f32(out);
}

The _mm_rsqrt_ps conversion will produce NaN if a source value is 0.0 (first INF for the reciprocal square root of 0.0, then INF * 0.0 using vmulq_f32). In contrast, the SSE counterpart produces INF if a source value is 0.0. As a result, additional treatments should be applied to ensure consistency between the conversion and its SSE counterpart.

Requirement

Developers are advised to utilize sse2neon.h with GCC version 10 or higher, or Clang version 11 or higher. While sse2neon.h might be compatible with earlier versions, certain vector operation errors have been identified in those versions. For further details, refer to the discussion in issue #622.

Usage

Compile-time Configurations

Though floating-point operations in NEON use the IEEE single-precision format, NEON does not fully comply to the IEEE standard when inputs or results are denormal or NaN values for minimizing power consumption as well as maximizing performance. Considering the balance between correctness and performance, sse2neon recognizes the following compile-time configurations:

The above are turned off by default, and you should define the corresponding macro(s) as 1 before including sse2neon.h if you need the precise implementations.

Run Built-in Test Suite

sse2neon provides a unified interface for developing test cases. These test cases are located in tests directory, and the input data is specified at runtime. Use the following commands to perform test cases:

$ make check

For running check with enabling features, you can use assign the features with FEATURE command. If none is assigned, then the command will be the same as simply calling make check. The following command enable crypto and crc features in the tests.

$ make FEATURE=crypto+crc check

For running check on certain CPU, setting the mode of FPU, etc., you can also assign the desired options with ARCH_CFLAGS command. If none is assigned, the command acts as same as calling make check. For instance, to run tests on Cortex-A53 with enabling ARM VFPv4 extension and NEON:

$ make ARCH_CFLAGS="-mcpu=cortex-a53 -mfpu=neon-vfpv4" check

Running tests on hosts other than ARM platform

For running tests on hosts other than ARM platform, you can specify GNU toolchain for cross compilation with CROSS_COMPILE command. QEMU should be installed in advance.

For ARMv8-A running in 64-bit mode type:

$ make CROSS_COMPILE=aarch64-linux-gnu- check # ARMv8-A

For ARMv7-A type:

$ make CROSS_COMPILE=arm-linux-gnueabihf- check # ARMv7-A

For ARMv8-A running in 32-bit mode (A32 instruction set) type:

$ make \
  CROSS_COMPILE=arm-linux-gnueabihf- \
  ARCH_CFLAGS="-mcpu=cortex-a32 -mfpu=neon-fp-armv8" \
  check 

Check the details via Test Suite for SSE2NEON.

Optimization

The SSE2NEON project is designed with performance-sensitive scenarios in mind, and as such, optimization options (e.g. O1, O2) can lead to misbehavior under specific circumstances. For example, frequent changes to the rounding mode or repeated calls to _MM_SET_DENORMALS_ZERO_MODE() may introduce unintended behavior.

Enforcing no optimizations for specific intrinsics could solve these boundary cases but may negatively impact general performance. Therefore, we have decided to prioritize performance and shift the responsibility for handling such edge cases to developers.

It is important to be aware of these potential pitfalls when enabling optimizations and ensure that your code accounts for these scenarios if necessary.

Adoptions

Here is a partial list of open source projects that have adopted sse2neon for Arm/Aarch64 support.

Related Projects

Reference

Licensing

sse2neon is freely redistributable under the MIT License.