agenium-scale / nsimd

Agenium Scale vectorization library for CPUs and GPUs
MIT License
325 stars 28 forks source link

How to use SSE2 128-bit pack in my project? #75

Closed luoming17 closed 3 years ago

luoming17 commented 3 years ago

Hey! I'm using this repository to optimize my code, but I don't know how to use 128-bit registers in my program. I'm compiling and runing in Linux, and my CPU is "Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz". I'm sure it supports SSE2. here are my operations.

I use this command to generate files first.

python3 egg/hatch.py -Af

My CMakeList.txt is like this.

set(NSIMD_INCLUDE_DIRS ${CMAKE_CURRENT_SOURCE_DIR}/nsimd/include)
ExternalProject_Add(nsimd
    SOURCE_DIR "${CMAKE_CURRENT_SOURCE_DIR}/nsimd"
    BINARY_DIR "${CMAKE_BINARY_DIR}/third_party/nsimd"
    CMAKE_CACHE_ARGS "-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=true"
    CMAKE_ARGS "-DCMAKE_INSTALL_PREFIX=${CMAKE_BINARY_DIR}/External/ -DSIMD=SSE2 -DSIMD_OPTIONALS=FMA"
)

My project code is like this.

#include <nsimd/nsimd-all.hpp>

...some codes...

using BaseType = int8_t;
using PackType = nsimd::pack<BaseType>;
uint64_t packLen = nsimd::len(PackType());
std::cout << packLen << "\n";

The output is "8". Did it means this pack only support 64bit data? How can I get a pack that support 128bit or more? Thank you.

gquintin commented 3 years ago

Hi luoming17,

For performances reasons a lot of NSIMD code is not compiled into the .so file. So when you compile your code with NSIMD you must specify which SIMD extension to use. The CMake SIMD variable is only there when testing the library and does not impact the compilation of your own code.

Moreover because of #26 NSIMD does not try to guess the SIMD extensions on its own. So in the CMake responsible for compiling your own code you should add something like add_compile_options(-DSSE2 -DFMA -msse2 -mfma).

luoming17 commented 3 years ago

Hi gquintin,

Thank you for your answer, I add this complie option and it works well. But when I testing load2a function, I found that it acted differently between int8_t and int16_t.

I have some raw aligned data generated by these code.

for (int i = 0; i < bufLen; i++) {
    buffer[i] = i % 100;
}

My test code is as follow.

template<typename BaseType>
static void testFn(char* originBuf, char* destBuf) {
    using PackType = nsimd::packx2<BaseType>;
    auto pack = nsimd::load2a<PackType>(reinterpret_cast<BaseType*>(originBuf));
    nsimd::storea(reinterpret_cast<BaseType*>(destBuf), pack.v0);
}

int main(void) {
    // some codes initial buffer and destBuf
    testFn<int8_t>(buffer, destBuf);
    dump_int8(destBuf); // 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28... 
    testFn<int16_t>(buffer, destBuf);
    dump_int8(destBuf); // 0, 1, 4, 5, 8, 9, 12, 13, 32, 33, 36, 37, 40, 41, 44, 45, 16, 17, 20, 21, 24, 25, 28, 29,
                        // 48, 49, 52, 53, 56, 57, 60, 61, 64, 65, 68, 69, 72, 73, 76, 77, 96, 97...
}

The result of int8_t met my expectation, but int16_t didn't.

// my expectation
// 0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20, 21, 24, 25, 28, 29, 32, 33, 36, 37, 40, 41, 44, 45, 48, 49,
// 52, 53, 56, 57, 60, 61, 64, 65, 68, 69, 72, 73, 76, 77, 80, 81...

I think load3a and load4a also have this behavior. Does nsimd have any APIs that can make the results meet my expectations?