Closed NK-Nikunj closed 4 years ago
Hi NK-Nikunj,
This is the expected behavior with ARM's compiler. As explained in ARM's documentation, SVE types have no size at compile time. Therefore you cannot have such types as a member of structs or classes. ARM introduced a new kind of struct called __sizeless_struct
which is able to have SVE types as members but which has the same limitations as SVE types: they cannot be members of standard structs or classes. Note that when compiling for SVE with ARM's compiler you are compiling a superset of C++ which adds __sizeless_struct
with its limitations. NSIMD's pack
is a __sizeless_struct
for SVE and therefore cannot be member of any struct or class and therefore cannot be used with STL containers (std::vector, std::map, std::tuple...).
The situation is different with GCC 10 (trunk when writing this). GCC uses another approach: it is mandatory to specifiy on command line SVE size with the -msve-vector-bits
switch. This makes SVE type have a size and behave like standard SIMD types. You can therefore use them in standard structs or classes and in any STL container. Thus your snippet of code should compile fine with GCC 10.
I advice you not to use SIMD vectors in this manner for several reasons:
std::vector<float>
everywhere and use nsimd::pack
's only in computations kernels.if (vec1[i] < vec2[i]) {...
You should seperate concerns, namely data storage and optimizations: you should use a std::vector
everywhere and use nsimd::pack's only in computations kernels.
@gquintin will it not be inefficient if I start loading std::vector<float>
to nsimd::pack<float>
in my computational kernel?
The loading of data is necessary in both situations. For any T
an std::vector<T>
lives in RAM, there is no exception if T == __m128
or T == __m256
. The loading of data takes place in the overloading of operator[]
of std::vector
.
#include <immintrin.h>
#include <vector>
void add_m128(std::vector<__m128> &dst,
std::vector<__m128> const &a,
std::vector<__m128> const &b,
int n) {
for (int i = 0; i < n; i += 4) {
dst[i] = a[i] + b[i];
}
}
void add_float(std::vector<float> &dst,
std::vector<float> const &a,
std::vector<float> const &b, int n) {
float *pdst = dst.data();
const float *pa = a.data();
const float *pb = b.data();
for (int i = 0; i < n; i += 4) {
_mm_store_ps(&pdst[i], _mm_load_ps(&pa[i]) +
_mm_load_ps(&pb[i]));
}
}
The assembly for both loops is exactly the same. For add_m128
:
.L3:
movaps xmm0, XMMWORD PTR [r9+rax]
addps xmm0, XMMWORD PTR [r8+rax]
movaps XMMWORD PTR [rsi+rax], xmm0
add rax, 64
cmp rdx, rax
jne .L3
For add_float
:
.L8:
movaps xmm0, XMMWORD PTR [rsi+rax*4]
addps xmm0, XMMWORD PTR [rdi+rax*4]
movaps XMMWORD PTR [rdx+rax*4], xmm0
add rax, 4
cmp ecx, eax
jg .L8
In both cases you will have two loads and on stores from/to memory. Moreover as you can see in my code I never use C++ constructs in performance critical code as the compiler does not knwo how to properly optimize code. As an exemple consider the following:
void add_float_cxx(std::vector<float> &dst,
std::vector<float> const &a,
std::vector<float> const &b, int n) {
for (int i = 0; i < n; i += 4) {
_mm_store_ps(&dst[i], _mm_load_ps(&a[i]) +
_mm_load_ps(&b[i]));
}
}
which produces the following assembly:
.L12:
mov r10, QWORD PTR [rsi]
mov r9, QWORD PTR [rdx]
mov r8, QWORD PTR [rdi]
movaps xmm0, XMMWORD PTR [r10+rax]
addps xmm0, XMMWORD PTR [r9+rax]
movaps XMMWORD PTR [r8+rax], xmm0
add rax, 16
cmp rcx, rax
jne .L12
which is way worse than the ones above. So as a conclusion: in performance critical kernels do not write C++ write only C.
While trying the following code:
I run into a compilation error:
How do I work with SVE vector packs?