agenium-scale / nsimd

Agenium Scale vectorization library for CPUs and GPUs
MIT License
325 stars 28 forks source link

Usage of SVE pack with vector #66

Closed NK-Nikunj closed 4 years ago

NK-Nikunj commented 4 years ago

While trying the following code:

#include <nsimd/nsimd-all.hpp>
#include <vector>

int main()
{
    std::vector<nsimd::pack<float> > vect;

    return 0;
}

I run into a compilation error:

gupta2@juawei-a27:~/codes/test_codes$ armclang++ -DNSIMD_SVE -march=armv8-a+sve -ftree-vectorize -I$HOME/install/arm/nsimd/include -L/$HOME/install/arm/nsimd/lib sve.cpp
In file included from sve.cpp:2:
In file included from /opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/vector:64:
/opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/stl_vector.h:286:35: error: arithmetic on a pointer to an incomplete type
      'nsimd::pack<float, 1, nsimd::sve>'
                      _M_impl._M_end_of_storage - _M_impl._M_start);
                      ~~~~~~~~~~~~~~~~~~~~~~~~~ ^
/opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/stl_vector.h:391:7: note: in instantiation of member function 'std::_Vector_base<nsimd::pack<float,
      1, nsimd::sve>, std::allocator<nsimd::pack<float, 1, nsimd::sve> > >::~_Vector_base' requested here
      vector()
      ^
sve.cpp:6:38: note: in instantiation of member function 'std::vector<nsimd::pack<float, 1, nsimd::sve>, std::allocator<nsimd::pack<float, 1, nsimd::sve> > >::vector' requested here
    std::vector<nsimd::pack<float> > vect;
                                     ^
In file included from sve.cpp:2:
In file included from /opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/vector:62:
/opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/stl_construct.h:136:25: error: incomplete type '_Value_type'
      (aka 'nsimd::pack<float, 1, nsimd::sve>') used in type trait expression
      std::_Destroy_aux<__has_trivial_destructor(_Value_type)>::
                        ^
/opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/stl_construct.h:206:7: note: in instantiation of function template specialization
      'std::_Destroy<nsimd::pack<float, 1, nsimd::sve> *>' requested here
      _Destroy(__first, __last);
      ^
/opt/ohpc/pub/ARM/opt/arm/gcc-8.2.0_Generic-AArch64_RHEL-7_aarch64-linux/lib/gcc/aarch64-linux-gnu/8.2.0/../../../../include/c++/8.2.0/bits/stl_vector.h:567:7: note: in instantiation of function template specialization
      'std::_Destroy<nsimd::pack<float, 1, nsimd::sve> *, nsimd::pack<float, 1, nsimd::sve> >' requested here
        std::_Destroy(this->_M_impl._M_start, this->_M_impl._M_finish,
             ^
sve.cpp:6:38: note: in instantiation of member function 'std::vector<nsimd::pack<float, 1, nsimd::sve>, std::allocator<nsimd::pack<float, 1, nsimd::sve> > >::~vector' requested here
    std::vector<nsimd::pack<float> > vect;
                                     ^
2 errors generated.

How do I work with SVE vector packs?

gquintin commented 4 years ago

Hi NK-Nikunj,

This is the expected behavior with ARM's compiler. As explained in ARM's documentation, SVE types have no size at compile time. Therefore you cannot have such types as a member of structs or classes. ARM introduced a new kind of struct called __sizeless_struct which is able to have SVE types as members but which has the same limitations as SVE types: they cannot be members of standard structs or classes. Note that when compiling for SVE with ARM's compiler you are compiling a superset of C++ which adds __sizeless_struct with its limitations. NSIMD's pack is a __sizeless_struct for SVE and therefore cannot be member of any struct or class and therefore cannot be used with STL containers (std::vector, std::map, std::tuple...).

The situation is different with GCC 10 (trunk when writing this). GCC uses another approach: it is mandatory to specifiy on command line SVE size with the -msve-vector-bits switch. This makes SVE type have a size and behave like standard SIMD types. You can therefore use them in standard structs or classes and in any STL container. Thus your snippet of code should compile fine with GCC 10.

I advice you not to use SIMD vectors in this manner for several reasons:

NK-Nikunj commented 4 years ago

You should seperate concerns, namely data storage and optimizations: you should use a std::vector everywhere and use nsimd::pack's only in computations kernels.

@gquintin will it not be inefficient if I start loading std::vector<float> to nsimd::pack<float> in my computational kernel?

gquintin commented 4 years ago

The loading of data is necessary in both situations. For any T an std::vector<T> lives in RAM, there is no exception if T == __m128 or T == __m256. The loading of data takes place in the overloading of operator[] of std::vector.

#include <immintrin.h>
#include <vector>

void add_m128(std::vector<__m128> &dst,
              std::vector<__m128> const &a,
              std::vector<__m128> const &b,
              int n) {
    for (int i = 0; i < n; i += 4) {
        dst[i] = a[i] + b[i];
    }
}

void add_float(std::vector<float> &dst,
               std::vector<float> const &a,
               std::vector<float> const &b, int n) {
    float *pdst = dst.data();
    const float *pa = a.data();
    const float *pb = b.data();
    for (int i = 0; i < n; i += 4) {
        _mm_store_ps(&pdst[i], _mm_load_ps(&pa[i]) +
                               _mm_load_ps(&pb[i]));
    }
}

The assembly for both loops is exactly the same. For add_m128:

.L3:
        movaps  xmm0, XMMWORD PTR [r9+rax]
        addps   xmm0, XMMWORD PTR [r8+rax]
        movaps  XMMWORD PTR [rsi+rax], xmm0
        add     rax, 64
        cmp     rdx, rax
        jne     .L3

For add_float:

.L8:
        movaps  xmm0, XMMWORD PTR [rsi+rax*4]
        addps   xmm0, XMMWORD PTR [rdi+rax*4]
        movaps  XMMWORD PTR [rdx+rax*4], xmm0
        add     rax, 4
        cmp     ecx, eax
        jg      .L8

In both cases you will have two loads and on stores from/to memory. Moreover as you can see in my code I never use C++ constructs in performance critical code as the compiler does not knwo how to properly optimize code. As an exemple consider the following:

void add_float_cxx(std::vector<float> &dst,
                   std::vector<float> const &a,
                   std::vector<float> const &b, int n) {
    for (int i = 0; i < n; i += 4) {
        _mm_store_ps(&dst[i], _mm_load_ps(&a[i]) +
                              _mm_load_ps(&b[i]));
    }
}

which produces the following assembly:

.L12:
        mov     r10, QWORD PTR [rsi]
        mov     r9, QWORD PTR [rdx]
        mov     r8, QWORD PTR [rdi]
        movaps  xmm0, XMMWORD PTR [r10+rax]
        addps   xmm0, XMMWORD PTR [r9+rax]
        movaps  XMMWORD PTR [r8+rax], xmm0
        add     rax, 16
        cmp     rcx, rax
        jne     .L12

which is way worse than the ones above. So as a conclusion: in performance critical kernels do not write C++ write only C.