TimelyDataflow / differential-dataflow

An implementation of differential dataflow using timely dataflow on Rust.
MIT License
2.55k stars 183 forks source link

Relax requirements of `Diff` #84

Open frankmcsherry opened 6 years ago

frankmcsherry commented 6 years ago

The Diff trait constraints the types "differences" that differential dataflow updates can use. It currently corresponds to an Abelian group, which is something that can be added, has an inverse, and satisfies associativity and commutativity.

This property is great in that operators consuming such differences as inputs can produce such differences as outputs. In general, operators like group need the flexibility to subtract, and we have not great information about the order of accumulation (we could try harder here, but there are some semantic issues iirc).

At the same time, some operators do not need to negate records, for example join. As well as operators like map, filter, concat, and such. It would seem such operators could use a less restrictive requirement, in particular they may not require inversion.

Removing the requirement of inversion would allow differential dataflow to express monotonic computations more clearly, in their type: if a record may only be added and never removed, this can be reflected in the type () which simply doesn't have an inverse (nor a zero, perhaps?). If a maximization or minimization can only move in one direction, we need not support full retractions, and combined with the type information the consumer can know that it should never expect such.

Both inputs and outputs can reflect this type information, and we could know that a collection at any point is monotonic or not, and if appropriate specialize the implementation.

It seems that many of the internal accumulation strategies could still apply even without inverses, as the rules of accumulation merge the results without negation (reflecting an opportunity we are ignoring). We may want to understand whether we require a Zero associated value, which is what allows us to discard fully ineffective updates, but for many monotonic quantities you simply can't achieve this anyhow (i.e. no retraction).

frankmcsherry commented 6 years ago

Lightly related to this, we could consider how we want to represent traits like Add, which is currently just Rust's Add trait. However, there are several reasons why this might suck.

We already have a notion of "zero", which is the absence of an update. There are some situations where we might not want to encode zero as a value, mostly where we would have a zero-sized type if we did not. Consider

  1. The two element group { 0, 1 } where each "update" flips the bit, allowing us to track parity of elements, and probably do a fine job representing set-valued things. If we need to have an i8 here to encode the data, that is fundamentally different than having a (), which we could do. However, if we need to be able to represent 0 as an element, we are in trouble.

  2. Similarly, we could imagine an append-only unary encoding, where the update () just means "one more of this". There isn't a zero value, and we wouldn't want to have to write one down.

Both of these suggest that using Rust's Add trait might not be great, because it is obliged to produce one element of the corresponding type, where "one" is limited in two ways: it cannot be none, and it cannot be a few.

The first example above suggests that if add returned an Option<()> we could use a ZST for the type. The second example above suggests that if add returned an Iterator<Item=()> we could use a ZST without "losing" updates. Together, they kind of suggest that if we returned an iterator over non-zero updates, we could use smaller types in these cases.

Such a change could also have positive implications for other small types; if we use i8 for updates then while they are mostly correct, instead of overflowing we could "add" 50i8, 51i8, 52i8 into the sequence 127i8 and 26i8, essentially doing arbitrary precision arithmetic. I don't know that this is a good idea, but it could now be up to the implementor to decide.

A downside of this approach is that it could introduce more branches all over the place, even for folks who want to use the standard isize sort of differences. I think if we are careful to implement the defaults with e.g. something that always produces Some(x + y) then LLVM can optimize out the branching (it seems to for other cases), but there will probably need to be corresponding complexity in the trace management logic, and elsewhere we might have assumed that only one diff exists for each time.

frankmcsherry commented 6 years ago

Note: the hopes of general use of ZST seems complicated by e.g. tuples, where we would want (T1: Diff, T2: Diff) to implement Diff, but it becomes complicated to represent e.g. (0, +5) if 0 does not have a representation in the type. We could consider a more rich trait system with Zeroable or somesuch, and tuples could implement Diff for (T1: Diff+Zeroable, T2: Diff+Zeroable).

frankmcsherry commented 6 years ago

Other potential issues (who know math could be so complicated?): Operators like join need to have some distributive property, and this can be awkward as well. Consider a ZST for { 0, 1 } joining against any other group:

1 * x + 1 * x = (1 + 1) * x = 0

is I guess what should happen. But, we have a hard time distinguishing this from

1 * x + 1 * y + 1 * (x-y)

which probably should accumulate to 1 * (x + x). So confusing, maybe not worth understanding.