TimelyDataflow / differential-dataflow

An implementation of differential dataflow using timely dataflow on Rust.
MIT License
2.53k stars 182 forks source link

`Builder::with_capacity` enrichment #428

Closed frankmcsherry closed 9 months ago

frankmcsherry commented 9 months ago

This PR does a few things brought to light by the intended RHH batch, around providing meaningful capacity estimates to the builder. We were previously .. not .. providing any meaningful capacity estimates, using Builder::new() which provides a capacity of 0.

This has been modified, at least in the Batcher implementations, although there are two additional locations where we build batches without a batcher; we'll want to think more carefully there (upsert.rs and reduce.rs).

The Builder::with_capacity method was extended to take key, value, and update capacities separately, but in some cases certain arguments are ignored.

cc: @antiguru

frankmcsherry commented 9 months ago

@antiguru: the main change here is that the MergeBatchers each do a bit more compute, and in exchange pre-allocate right-sized builders. These builders are perhaps a bit oversized for ord implementations, but they should be spot-on for ord_neu implementations (except for the singleton update trick; sigh). This may have MZ consequences we should investigate, but one of them could be that without mis-sized builders we needn't repeatedly double allocations, and may end up with a lower memory envelope overall.

frankmcsherry commented 9 months ago

The implementations/rhh module contains a prototype Robin Hood Hashing based batch, which is a surprisingly small change to the standard batch we use (just some more spacing, and a bit of fiddly logic all over the place). For now the module is public but there are no exported traces, and no one should use it yet, other than for testing which I am doing now!