Closed shwestrick closed 1 year ago
why are we using list of chunks? why not array of chunks (possibly with doubling to grow)?
Originally we chose list-of-chunks for simplicity. O(1) for both insertion and concatenate is really nice. We use both operations extensively... I don't think it would be easy to rework things to avoid concatenate.
Also, the amortization for doubling wouldn't play nice with span.
The ideal data structure would be non-amortized with:
O(1)
insertionO(1)
concatenateO(log n)
split (or O(1)
, of course 😄)IIRC there is a "bag" data structure that has these guarantees. I can't remember it off the top of my head at the moment. If we're willing to pay log-cost for concatenate, then there are lots of good options.
every concat corresponds to a join right?
Essentially yes. Both logical joins and true joins.
cool, if so, would the following work: -- base case: create array -- at joins/concats: simply link up the arrays with a pointer, or possibly with a "node" that has them as its children
the length of such a thing should be bounded by span...
Hmm, I'll have to think about this one. Seems like it could increase the overall span to $O(S^2)$ by paying $O(S)$ span per join.
A bound like $O(S log C)$ where $C$ is the maximum number of candidates might be better, and we can get this bound with a balanced tree.
ok let's discuss in person
After discussion today:
Altogether, this patch implements three fixes / performance improvements:
#if
vs #ifdef
typoThe results are Very Very Good™️
Merging now 🎉
Integrate with the scheduler to clear suspects in parallel. This patch subsumes #167 but I will keep the other open for now, for potential discussion.
Performance improvements. Big! On 72 procs,
ms-queue
is now at ~30x speedup over sequential (old: 19x), andlinden-pq
is now at ~36x speedup over sequential (old: ~30x)Algorithm description. The algorithm in this patch is straightforward. Clearing suspects is a bucketed filter: some elements are eliminated, and some survive; each of the survivors is moved into one of $D$ buckets (where $D$ is the depth of the heap whose suspects are being cleared). To parallelize this, we (1) break up the input, (2) run multiple sequential filters in parallel, and then finally (3) merge the results.
TODO: algorithmic improvements? This algorithm has three phases. The middle phase is highly parallel, but both the first and last phases are sequential. In the first phase, the algorithm converts the input list (of chunks) into an array (of chunks). In the last phase, it sequentially merges bucketed outputs. Probably neither of these is a performance bottleneck at the moment. But in the future, there are opportunities for more parallelism. To parallelize the first phase, we could maintain all suspect sets as balanced trees, to enable fast balanced splitting. To parallelize the last phase, we could merge results with a D&C reduction.
TODO: implementation optimizations? This particular implementation could be further optimized in a few ways.
ES_deleteClearSet
. But we could instead free them inES_processClearSetGrain
, by eagerly freeing each processed chunk we see along the way.