Open frankmcsherry opened 5 years ago
More generally, if we are currently performing a reduction by (key, val)
but would prefer either the input or output arrangement to be by key
, we could choose that arrangement instead (but the choice would apply to both input and output arrangements). The observation is just that a partitioning by key
co-locates all (key, val)
pairs as well, and the arrangement places their contents in close proximity.
There is the caveat that if there are other fields, then we would want the arrangement to be of (key, (val1, val2)
in order to have access to the pairs (key, val1)
. Otherwise, the val2
fields could dislocate otherwise identical val1
entries.
At the moment
distinct
operates on arrangements only by key, with an empty value. It seems like we should be able to implementdistinct
on arrangements by key and value, where the distribution is by key but there are attendant values each of which we want to occur at most once. The existingreduce
code is not sufficient, but the permitted cursor navigation seems appropriate to drive another implementation to the history for pairs(key, val)
just as if they were the key themselves.The potential benefit from this optimization would be more available sharing for Datalog style computations which often use
distinct
, and then join their results in subsequent rounds of derivation. They relatively rarely use a by-self arrangement for the join, and instead have some key in mind that we could also use for the distinct.cc: @ryzhyk