Open wheatman opened 2 years ago
I'm starting to look into this issue. But at the moment I've only got access to an ARM64 system, and the Parlay scheduler doesn't work correctly there. (Oftentimes it will hang, sometimes it returns the wrong sum, and at one point it crashed with a failed assertion.). I'll take another look once I get access to an Intel box.
I confirmed the behavior on x86. Time with log size 20 is greater with the Cilk version and is quite variable. Also, using cilk_for
instead of recursive spawn makes the peformance on small sizes much worse.
On my FreeBSD ARM system the parlay version tends to hang.
I managed to run some tests on an Intel machine. I'm still trying to diagnose the issue, but here are some more notes of things I've found so far:
@VoxSciurorum For the cilk_for
test you ran, how were you computing the sum? Were you using a reducer?
So I tried to generate a bunch of data to see which cases matter
I added two more versions to the above code which was one using a cilk_for with a new reducer, and another serial version.
those two versions are below.
I first tried generating data for all sizes between 20 and 30 for 101 trials and was getting a much smaller difference than I was seeing before, so I hypothosized that their was some sort of start up costs. So I also added the data for 11 trials
The data can be found here https://docs.google.com/spreadsheets/d/1F5V629TlNribEs7ipiBOCNFu-eek0fn9oyb3M9FTO74/edit?usp=sharing
Each sheet says in what configuration it was run it and how many trials were run
On each sheet I have the time in microseconds for each size for all 4 different sums and I present the 10%, 50% and 90%.
I also show what the speedup over the serial code each one got, as well as the slowdown from the fastest version.
What I see from the data is that when running only 11 trials anything with numa or hyperthreading parlay seems to have the most wins.
This trend is much less pronounced when running 101 trials, but still the standard one on all the cores is still the worst for cilk, and no numa no hyperthreads is the best.
Let me know if anything about my experiment doesn't make sense, or it would be helpful for me to run anything else.
void zero(void *v) { *(uint64_t *)v = 0; }
void plus(void *l, void *r) { *(uint64_t *)l += *(uint64_t *)r; }
uint64_t sum_cilk_for(const std::vector<uint64_t> &data) {
uint64_t cilk_reducer(zero, plus) sum = 0;
cilk_for(uint64_t i = 0; i < data.size(); ++i) sum += data[i];
return sum;
}
uint64_t sum_serial(const std::vector<uint64_t> &data) {
uint64_t sum = 0;
for (uint64_t i = 0; i < data.size(); ++i)
sum += data[i];
return sum;
}
In general, Cilk spawning has less overhead than parlaylib task creation but Cilk scheduling has higher overhead than parlaylib scheduling. This puts Cilk at a disadvantage when very small amounts of work are stolen.
We would normally recommend the cilk_for
with reducer approach you tried. The compiler tries to pick a good threshold to switch to serial code instead of divide-and-conquer. This is what you did manually with your 100 iteration threshold. As you noticed, performance of the reducer version is unimpressive. This is because of an unrelated problem: the loop vectorizer does not work well with reducers and the inner loop uses scalar instructions where the serial version uses SIMD. I will open a separate issue to track this problem.
Describe the bug
I noticed I was getting slower than expected performance in a fairly simple parallel sum and compared it to comparative code using parlaylib's schedule. I made a minimal example which shows that for a fairly simple parallel sum the parlaylib version was about 5x faster for small cases. The difference decreases as the total size increases.
For a vector of length 2^20 cilk took a median of 600 microseconds to sum up the elemnts, while parlaylib only took about 120 microseconds
by length 2^25 parlaylib takes about 2,000 microseconds vs cilks 3,500
by 2^30 parlaylib is taking about 78,000 microseconds vs cilks 80,000 microseconds
Expected behavior
I would expect fairly similar performance in what I expected to be a fairly compute or memory bound program
OpenCilk version
clang version 14.0.6 (https://github.com/OpenCilk/opencilk-project fc90ded2b090672f84c58d12d8d85cd999eb6c1a)
System information
NAME="Ubuntu" VERSION="20.04.1 LTS (Focal Fossa)"
2 Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Steps to reproduce (include relevant output)
The code I used is bellow, you will need to clone the parlaylib repo and put it in the same directory
compiled with
clang++ -Wall -Wextra -Ofast -std=c++20 -Wno-deprecated-declarations -g -gdwarf-4 -march=native -fopencilk -lpthread -o run main.cpp