TimelyDataflow / differential-dataflow

An implementation of differential dataflow using timely dataflow on Rust.
MIT License
2.53k stars 182 forks source link

Encode singleton repetition update cleverly #421

Closed frankmcsherry closed 9 months ago

frankmcsherry commented 9 months ago

This PR introduces a clever optimization for representing in a batch any (Key, Val) pairs with a singleton (Time, Diff) that happens to exactly match the previous (Time, Diff). This happens often in e.g. snapshot batches, where often the time is "zero" (or equivalent modulo frontiers) and the diff is "one".

Conventionally for each (Key, Val) we record an offset that indicates the upper bound of the corresponding (Time, Diff) entries, starting from the previous offset: the upper bound of the prior keyvals list of timediffs. Conventionally, each recorded offset must be strictly greater than the offset that precedes it, because we simply don't record absent updates.

We take advantage of this to signal a repeated timediff by repeating the offset. Essentially, if two updates (lower, upper) are equal, the range that should be used is instead (lower - 1, upper), picking up the single previous timediff. We need to be careful to read out the right data, also to encode the data correctly, and also to report the total number of updates correctly (it is no longer updates.len()).

When running

cargo run --release --example bfs -- 100000000 200000000 1000 0 potato

this change improves the limiting memory use (at the end of the execution, with all data in batches) from ~2.75GB to ~1.50GB. This program uses u32 keys and values, which means that the times and diffs are substantial fat to cut off.