Clear candidates (suspects) in parallel: entanglement management perf improvement (and other fixes)

shwestrick commented 1 year ago

Integrate with the scheduler to clear suspects in parallel. This patch subsumes #167 but I will keep the other open for now, for potential discussion.

Performance improvements. Big! On 72 procs, ms-queue is now at ~30x speedup over sequential (old: 19x), and linden-pq is now at ~36x speedup over sequential (old: ~30x)

Algorithm description. The algorithm in this patch is straightforward. Clearing suspects is a bucketed filter: some elements are eliminated, and some survive; each of the survivors is moved into one of $D$ buckets (where $D$ is the depth of the heap whose suspects are being cleared). To parallelize this, we (1) break up the input, (2) run multiple sequential filters in parallel, and then finally (3) merge the results.

TODO: algorithmic improvements? This algorithm has three phases. The middle phase is highly parallel, but both the first and last phases are sequential. In the first phase, the algorithm converts the input list (of chunks) into an array (of chunks). In the last phase, it sequentially merges bucketed outputs. Probably neither of these is a performance bottleneck at the moment. But in the future, there are opportunities for more parallelism. To parallelize the first phase, we could maintain all suspect sets as balanced trees, to enable fast balanced splitting. To parallelize the last phase, we could merge results with a D&C reduction.

TODO: implementation optimizations? This particular implementation could be further optimized in a few ways.

Don't rely on malloc/free for intermediate data. Use the block allocator directly instead, to avoid the malloc/free bottleneck.
Free old chunks in parallel. Currently we free these chunks sequentially in ES_deleteClearSet. But we could instead free them in ES_processClearSetGrain, by eagerly freeing each processed chunk we see along the way.

umutacar commented 1 year ago

why are we using list of chunks? why not array of chunks (possibly with doubling to grow)?

shwestrick commented 1 year ago

Originally we chose list-of-chunks for simplicity. O(1) for both insertion and concatenate is really nice. We use both operations extensively... I don't think it would be easy to rework things to avoid concatenate.

Also, the amortization for doubling wouldn't play nice with span.

The ideal data structure would be non-amortized with:

O(1) insertion
O(1) concatenate
O(log n) split (or O(1), of course 😄)

IIRC there is a "bag" data structure that has these guarantees. I can't remember it off the top of my head at the moment. If we're willing to pay log-cost for concatenate, then there are lots of good options.

umutacar commented 1 year ago

every concat corresponds to a join right?

shwestrick commented 1 year ago

Essentially yes. Both logical joins and true joins.

umutacar commented 1 year ago

cool, if so, would the following work: -- base case: create array -- at joins/concats: simply link up the arrays with a pointer, or possibly with a "node" that has them as its children

the length of such a thing should be bounded by span...

shwestrick commented 1 year ago

Hmm, I'll have to think about this one. Seems like it could increase the overall span to $O(S^2)$ by paying $O(S)$ span per join.

A bound like $O(S log C)$ where $C$ is the maximum number of candidates might be better, and we can get this bound with a balanced tree.

umutacar commented 1 year ago

ok let's discuss in person

shwestrick commented 1 year ago

After discussion today:

Altogether, this patch implements three fixes / performance improvements:

Parallelization of suspect clearing
Eliminate EBR overhead for read barriers
Bugfix: #if vs #ifdef typo

The results are Very Very Good™️

Merging now 🎉

MPLLang / mpl

Clear candidates (suspects) in parallel: entanglement management perf improvement (and other fixes) #168