TimelyDataflow / differential-dataflow

An implementation of differential dataflow using timely dataflow on Rust.
MIT License
2.56k stars 183 forks source link

Consolidate output diffs before producing tuples #489

Open frankmcsherry opened 4 months ago

frankmcsherry commented 4 months ago

The Arranged::as_collection method converts an arrangement to a stream of its underlying updates. It does this without any thought for consolidation as it does, under the historically accurate but no longer correct take that the full arrangement detail is required. When one uses a TraceFrontier wrapper, which advances contained timestamps, many of these updates may now consolidate. We should do the work here, at the first moment, rather than downstream. The incremental cost means additional work for folks who don't need it, but the harm done to folks using TraceFrontier can be much larger.

ggevay commented 4 months ago

In the common case where the arrangement is not a retain history index, will this consolidation still do something?

Also, will this increase the memory spikes of initial hydrations where the arrangement is not a retain history index?

frankmcsherry commented 4 months ago

In the common case where the arrangement is not a retain history index, will this consolidation still do something?

Probably not. For an alternate approach, https://github.com/TimelyDataflow/differential-dataflow/pull/490 attaches the logic to TraceFrontier, so that folks who are not importing traces will not experience any overhead.

Also, will this increase the memory spikes of initial hydrations where the arrangement is not a retain history index?

I hope not. The buffer size is going to be proportional to the longest history of any one (key, val) pair, and .. we could dial that down with a ChangeBatch style structure, which consolidates as it goes (if the history is one that consolidates down, it would be even less consequential). Though, getting ChangeBatch out of timely and into differential would be a bunch of typing.