performance overhead of cilk_span vs parlay::par_do

wheatman commented 2 years ago

Describe the bug

I noticed I was getting slower than expected performance in a fairly simple parallel sum and compared it to comparative code using parlaylib's schedule. I made a minimal example which shows that for a fairly simple parallel sum the parlaylib version was about 5x faster for small cases. The difference decreases as the total size increases.

For a vector of length 2^20 cilk took a median of 600 microseconds to sum up the elemnts, while parlaylib only took about 120 microseconds

by length 2^25 parlaylib takes about 2,000 microseconds vs cilks 3,500

by 2^30 parlaylib is taking about 78,000 microseconds vs cilks 80,000 microseconds

Expected behavior

I would expect fairly similar performance in what I expected to be a fairly compute or memory bound program

OpenCilk version

clang version 14.0.6 (https://github.com/OpenCilk/opencilk-project fc90ded2b090672f84c58d12d8d85cd999eb6c1a)

System information

NAME="Ubuntu" VERSION="20.04.1 LTS (Focal Fossa)"

2 Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz

Steps to reproduce (include relevant output)

The code I used is bellow, you will need to clone the parlaylib repo and put it in the same directory

compiled with clang++ -Wall -Wextra -Ofast -std=c++20 -Wno-deprecated-declarations -g -gdwarf-4 -march=native -fopencilk -lpthread -o run main.cpp


#include <algorithm>
#include <cstdint>
#include <cstdio>

#include <string>
#include <sys/time.h>
#include <vector>

#include "parlaylib/include/parlay/parallel.h"

#include <cilk/cilk.h>

static inline uint64_t get_usecs() {
  struct timeval st {};
  gettimeofday(&st, nullptr);
  return st.tv_sec * 1000000 + st.tv_usec;
}

template <class T> std::vector<T> create_random_data(size_t n) {
  std::vector<T> v(n);
  cilk_for(uint64_t i = 0; i < n; i++) { v[i] = i; }
  return v;
}

template <class T>
T sum_serial(const std::vector<T> &data, size_t start, size_t end) {
  T total = 0;
  for (uint64_t i = start; i < end; i++) {
    total += data[i];
  }
  return total;
}

template <class T>
T sum_cilk(const std::vector<T> &data, size_t start, size_t end) {
  if (end - start < 100) {
    return sum_serial(data, start, end);
  }
  uint64_t sum_a;
  cilk_spawn sum_a = sum_cilk(data, start, start + (end - start) / 2);
  uint64_t sum_b = sum_cilk(data, start + (end - start) / 2, end);
  cilk_sync;
  return sum_a + sum_b;
}

template <class T>
T sum_parlaylib(const std::vector<T> &data, size_t start, size_t end) {
  if (end - start < 100) {
    return sum_serial(data, start, end);
  }
  uint64_t sum_a;
  uint64_t sum_b;
  parlay::par_do(
      [&]() { sum_a = sum_parlaylib(data, start, start + (end - start) / 2); },
      [&]() { sum_b = sum_parlaylib(data, start + (end - start) / 2, end); });
  return sum_a + sum_b;
}

int main(int argc, char *argv[]) {
  if (argc != 3) {
    printf(
        "call with the log_2 of the size to test, and the schedular to use\n");
  }
  size_t power_of_2 = std::strtol(argv[1], nullptr, 10);
  auto data = create_random_data<uint64_t>(1UL << power_of_2);

  std::vector<uint64_t> times;
  uint64_t iters = 11;
  for (uint64_t i = 0; i < iters; i++) {
    if (std::string("cilk") == argv[2]) {
      uint64_t start = get_usecs();
      uint64_t total = sum_cilk(data, 0, data.size());
      uint64_t end = get_usecs();
      times.push_back(end - start);
      printf("the total was %lu, took %lu\n", total, end - start);
    }
    if (std::string("parlay") == argv[2]) {
      uint64_t start = get_usecs();
      uint64_t total = sum_parlaylib(data, 0, data.size());
      uint64_t end = get_usecs();
      times.push_back(end - start);
      printf("the total was %lu, took %lu\n", total, end - start);
    }
  }
  std::sort(times.begin(), times.end());
  printf("the median time was %lu\n", times[iters / 2]);
}

neboat commented 2 years ago

I'm starting to look into this issue. But at the moment I've only got access to an ARM64 system, and the Parlay scheduler doesn't work correctly there. (Oftentimes it will hang, sometimes it returns the wrong sum, and at one point it crashed with a failed assertion.). I'll take another look once I get access to an Intel box.

VoxSciurorum commented 2 years ago

I confirmed the behavior on x86. Time with log size 20 is greater with the Cilk version and is quite variable. Also, using cilk_for instead of recursive spawn makes the peformance on small sizes much worse.

On my FreeBSD ARM system the parlay version tends to hang.

neboat commented 2 years ago

I managed to run some tests on an Intel machine. I'm still trying to diagnose the issue, but here are some more notes of things I've found so far:

According to Cilkscale, this program has very low parallelism for the input sizes mentioned. In particular, the parallelism is much smaller than the number of cores or hyperthreads on the system. Often times, the parallelism is smaller than the number of cores on one socket.
I'm also seeing that the Cilk performance is highly variable for the input size 2^20 on large worker counts. Some of the variability seems to correspond with running across multiple NUMA nodes, and some seems to correspond with using hyperthreads, but I'm not sure that accounts for all of the variability.
Sweeping through different worker or thread counts, for an input size of 2^25, the best Cilk running time is 30% faster than the best Parlay running time. So with high worker counts we're seeing the Cilk performance fall off more from its fastest running time than Parlay does.
For the 2^25 input size, the performance difference between Cilk and Parlay seems to be nonexistent when running the program without hyperthreading. But the Cilk performance is worse than Parlay when using hyperthreading.
For the 2^30 input size, I'm not yet convinced I can measure a significant performance difference between Cilk and Parlay on large worker counts.

@VoxSciurorum For the cilk_for test you ran, how were you computing the sum? Were you using a reducer?

wheatman commented 2 years ago

So I tried to generate a bunch of data to see which cases matter

I added two more versions to the above code which was one using a cilk_for with a new reducer, and another serial version.

those two versions are below.

I first tried generating data for all sizes between 20 and 30 for 101 trials and was getting a much smaller difference than I was seeing before, so I hypothosized that their was some sort of start up costs. So I also added the data for 11 trials

The data can be found here https://docs.google.com/spreadsheets/d/1F5V629TlNribEs7ipiBOCNFu-eek0fn9oyb3M9FTO74/edit?usp=sharing

Each sheet says in what configuration it was run it and how many trials were run

On each sheet I have the time in microseconds for each size for all 4 different sums and I present the 10%, 50% and 90%.

I also show what the speedup over the serial code each one got, as well as the slowdown from the fastest version.

What I see from the data is that when running only 11 trials anything with numa or hyperthreading parlay seems to have the most wins.

This trend is much less pronounced when running 101 trials, but still the standard one on all the cores is still the worst for cilk, and no numa no hyperthreads is the best.

Let me know if anything about my experiment doesn't make sense, or it would be helpful for me to run anything else.

void zero(void *v) { *(uint64_t *)v = 0; }

void plus(void *l, void *r) { *(uint64_t *)l += *(uint64_t *)r; }

uint64_t sum_cilk_for(const std::vector<uint64_t> &data) {
  uint64_t cilk_reducer(zero, plus) sum = 0;
  cilk_for(uint64_t i = 0; i < data.size(); ++i) sum += data[i];
  return sum;
}

uint64_t sum_serial(const std::vector<uint64_t> &data) {
  uint64_t sum = 0;
  for (uint64_t i = 0; i < data.size(); ++i)
    sum += data[i];
  return sum;
}

VoxSciurorum commented 1 year ago

In general, Cilk spawning has less overhead than parlaylib task creation but Cilk scheduling has higher overhead than parlaylib scheduling. This puts Cilk at a disadvantage when very small amounts of work are stolen.

We would normally recommend the cilk_for with reducer approach you tried. The compiler tries to pick a good threshold to switch to serial code instead of divide-and-conquer. This is what you did manually with your 100 iteration threshold. As you noticed, performance of the reducer version is unimpressive. This is because of an unrelated problem: the loop vectorizer does not work well with reducers and the inner loop uses scalar instructions where the serial version uses SIMD. I will open a separate issue to track this problem.

OpenCilk / opencilk-project