TimelyDataflow / differential-dataflow

An implementation of differential dataflow using timely dataflow on Rust.
MIT License
2.56k stars 183 forks source link

Introduce Time/Diff GATs #502

Closed frankmcsherry closed 4 months ago

frankmcsherry commented 4 months ago

The PR slowly introduces GATs for Time and Diff. It largely reproduces work of @antiguru, but in an incremental fashion that helps me at least understand what is going on. The incremental steps are:

  1. The first commit generalizes the Layout containers to provide separate storage for times and diffs, which allows those containers to speak clearly about associated types for those containers. The GATs are immediately converted to owned data, so the transient allocations are still high (increased, even), but the storage opportunities are increased by allowing the containers to store their data in a way that doesn't need to present as &Time and &Diff.
  2. The second and third commits introduce DiffGat<'_> and TimeGat<'_> respectively, which are exactly the GATs offered by the underlying containers, but surfaced upwards through the APIs provided by cursors. Concretely this means that map_times() communicates GATs rather than references, and users of the function can avoid allocation if they do not require it. These commits did not try especially hard to prevent immediately calling into_owned() and re-invoking transient allocations.
  3. The fourth commit tries harder to eliminate calls to into_owned(), which are the moments we convert from GATs to owned types, and where we might perform needless allocation. The commit generalizes Semigroup to act on references, and e.g. the constraint that Tr::Diff : for<'a> Semigroup<Tr::DiffGat<'a>> to ensure that we can accumulate the GAT types into the owned types directly, without needing to first take ownership. Similar changes for times, though mostly around streamlining the idiom of time.join(meet) for a GAT time (take ownership, then join_assign).

There is work to continue to progressively nibble away at the reliance on the owned types, though unlike keys and values they are unlike to go away: both time and diff need to be mutated (advancing to frontiers and consolidation, respectively). This seems likely to force some owned representation, as one cannot simply rely on the existing read-only data in the backing storage.

frankmcsherry commented 4 months ago

The current state of the PR, to my understanding is:

  1. The first commit allows containers with GATs to store Time and Diff, but does not otherwise change very much. Folks still get to see &Time and &Diff, because step one is to call .into_owned() on the GATs. This introduces some new allocations, and it feels especially bad in cases where we were not previously going to allocate (e.g. using a &Time) and we should look at each of these. But importantly, this first step opens up more efficient storage at the cost of ephemeral allocations (improving RAM by spending CPU) and may already be worth landing.
  2. The second and third commits move the information about those GATs up the stack, through fn map_times returning GATs rather than references, which means the storage layer doesn't need to call .into_owned() any more. In several cases, we still just call .into_owned(), but we can start to incrementally nibble at uses of GATs that are force into owned versions and replace them with uses of GATs instead. I started this and got tripped up in some places, and so went with a minimal version that unless it is quite sure, just calls into_owned() and moves on.

I think the main remaining question is how much .into_owned() minimization must we do before landing the PR, or one like it. The into_owned() calls are a regression for someone storing String data, say, but are almost certainly a win for anyone who wants a FlatContainer or Columnation back-end. In the limit, it would be great to have our internals have Cow<'a, Diff> everywhere, to only allocate when we need to accumulate, and generally to minimize ephemeral allocations as well. But for next steps, we should probably go over the most egregious .into_owned() calls, especially when ownership was not previously required, and taking stock of them.