Closed wheatman closed 8 months ago
It looks like your data
vector stores a number of elements — 20 — that is not divisible by 8, the number of elements in an AVX512 register. I'm guessing that causes your AVX512 codes to read invalid memory. Although the code might get lucky most of the time in seeing only 0 values in those invalid locations, the Cilk reducer mechanism does memory allocations internally, which might result in those AVX512 operations reading non-zero values from time to time.
I need to set things up on a machine with AVX512 instructions to confirm the issue, unless you can check it first. You might try running ASan on your code to see if it detects a memory issue.
The Cilksan segfault might be related to the same issue, if my guess is correct. I'll take a look at that.
The data size should be a power of 2
std::vector<double> make_data(size_t logN) {
std::vector<double> data(1UL << logN);
for (size_t i = 0; i < (1UL << logN); i++) {
data[i] = i;
}
return data;
}
Ah, I missed that. Thanks for pointing it out.
Also, I just tried running with asan and got no error in a run with bad results
$ opencilk16/build/bin/clang++ -fopencilk -march=native -std=c++20 -fsanitize=address del.cpp
$ ./a.out
3.84307e+17
3.84307e+17
3.7586e+179
3.7586e+179
So I managed to simplify it a bit, switched from doubles to uint64_t , and got rid of a square. Now both approaches that use vectorized reducers clearly give the wrong answer and do so consistently.
Here is the updated code
#include <cstdlib>
#include <cilk/cilk.h>
#include <cilk/opadd_reducer.h>
#include <immintrin.h>
#include <iostream>
#include <vector>
void zero_vec_512(void *view) {
*((__m512i *)view) = _mm512_setzero_si512(); // Initialize sum as zero
}
void add_vec_512(void *left, void *right) {
*((__m512i *)left) =
_mm512_add_epi64(*((__m512i *)left), *((__m512i *)right));
}
void zero_vec_256(void *view) {
*((__m256i *)view) = _mm256_setzero_si256(); // Initialize sum as zero
}
void add_vec_256(void *left, void *right) {
*((__m256i *)left) =
_mm256_add_epi64(*((__m256i *)left), *((__m256i *)right));
}
std::vector<uint64_t> make_data(size_t logN) {
std::vector<uint64_t> data(1UL << logN);
for (size_t i = 0; i < (1UL << logN); i++) {
data[i] = i;
}
return data;
}
uint64_t sum_serial(const auto &data) {
uint64_t sum(0);
for (std::size_t i = 0; i < data.size(); ++i) {
sum += data[i];
}
return sum;
}
uint64_t sum_parallel(const auto &data) {
cilk::opadd_reducer<uint64_t> sum(0);
cilk_for(std::size_t i = 0; i < data.size(); ++i) { sum += data[i]; }
return sum;
}
uint64_t sum256(const auto &data) {
__m256i cilk_reducer(zero_vec_256, add_vec_256) sum_vec;
cilk_for(std::size_t i = 0; i < data.size(); i += 4) {
__m256i data_vec = _mm256_loadu_si256((__m256i *)(data.data() + i));
// data_vec = _mm256_mullo_epi64(data_vec, data_vec);
sum_vec += data_vec;
}
uint64_t sum_array[4];
_mm256_storeu_epi64(sum_array, sum_vec); // Store the results back to an array
uint64_t sum = sum_array[0] + sum_array[1] + sum_array[2] + sum_array[3];
return sum;
}
uint64_t sum512_parallel(const auto &data) {
__m512i cilk_reducer(zero_vec_512, add_vec_512) sum_vec;
cilk_for(std::size_t i = 0; i < data.size(); i += 8) {
__m512i data_vec = _mm512_loadu_si512(data.data() + i);
// data_vec = _mm512_mullo_epi64(data_vec, data_vec);
sum_vec += data_vec;
}
return _mm512_reduce_add_epi64(sum_vec);
}
uint64_t sum512_serial(const auto &data) {
__m512i sum_vec = _mm512_setzero_si512();
for (std::size_t i = 0; i < data.size(); i += 8) {
__m512i data_vec = _mm512_loadu_si512(data.data() + i);
// data_vec = _mm512_mullo_epi64(data_vec, data_vec);
sum_vec += data_vec;
}
return _mm512_reduce_add_epi64(sum_vec);
}
int main(int argc, char *argv[]) {
auto data = make_data(20);
std::cout << "data size = " << data.size() << "\n";
std::cout << "serial no vectorization: " << sum_serial(data) << "\n";
std::cout << "parallel no vectorization: " << sum_parallel(data) << "\n";
std::cout << "parallel 256: " << sum256(data) << "\n";
std::cout << "serial 512: " << sum512_serial(data) << "\n";
std::cout << "parallel 512: " << sum512_parallel(data) << "\n";
}
output
data size = 1048576
serial no vectorization: 549755289600
parallel no vectorization: 549755289600
parallel 256: 281640026961247
serial 512: 549755289600
parallel 512: 516246975623928
You need to initialize your reducer variables, just like you would initialize them in the serial projection. For example:
uint64_t sum512_parallel(const auto &data) {
- __m512i cilk_reducer(zero_vec_512, add_vec_512) sum_vec;
+ __m512i cilk_reducer(zero_vec_512, add_vec_512) sum_vec = _mm512_setzero_si512();
cilk_for(std::size_t i = 0; i < data.size(); i += 8) {
__m512i data_vec = _mm512_loadu_si512(data.data() + i);
// data_vec = _mm512_mullo_epi64(data_vec, data_vec);
sum_vec += data_vec;
}
return _mm512_reduce_add_epi64(sum_vec);
}
Ah thank you, my mistake, I thought that was handled with the identity function.
Describe the bug
Trying to use a vector reducer and get non deterministic results only with avx512 vectors, not 256 bit ones
Expected behavior
deterministic results in summing a vector
OpenCilk version
clang version 16.0.6 (https://github.com/OpenCilk/opencilk-project df80b7d1b218109ee7427c0ef498848d9a63c118)
System information
VERSION="20.04.1 LTS (Focal Fossa)" c5.xlargs from aws
Steps to reproduce (include relevant output)
code
build with
opencilk16/build/bin/clang++ -fopencilk -march=native -std=c++20 del.cpp
It should output
but sometimes either the 3rd or 4th row is equal to 3.7586e+179
I know that is one of the serial ones, but I was never able to reproduce running just the serial one, so I think its something else. When trying to run with cilksan I got segfaults