TimelyDataflow / differential-dataflow

An implementation of differential dataflow using timely dataflow on Rust.
MIT License
2.6k stars 183 forks source link

Implement `distinct` for arrangements by key and value. #208

Open frankmcsherry opened 5 years ago

frankmcsherry commented 5 years ago

At the moment distinct operates on arrangements only by key, with an empty value. It seems like we should be able to implement distinct on arrangements by key and value, where the distribution is by key but there are attendant values each of which we want to occur at most once. The existing reduce code is not sufficient, but the permitted cursor navigation seems appropriate to drive another implementation to the history for pairs (key, val) just as if they were the key themselves.

The potential benefit from this optimization would be more available sharing for Datalog style computations which often use distinct, and then join their results in subsequent rounds of derivation. They relatively rarely use a by-self arrangement for the join, and instead have some key in mind that we could also use for the distinct.

cc: @ryzhyk

frankmcsherry commented 5 years ago

More generally, if we are currently performing a reduction by (key, val) but would prefer either the input or output arrangement to be by key, we could choose that arrangement instead (but the choice would apply to both input and output arrangements). The observation is just that a partitioning by key co-locates all (key, val) pairs as well, and the arrangement places their contents in close proximity.

There is the caveat that if there are other fields, then we would want the arrangement to be of (key, (val1, val2) in order to have access to the pairs (key, val1). Otherwise, the val2 fields could dislocate otherwise identical val1 entries.