Closed frankmcsherry closed 4 months ago
The current state of the PR, to my understanding is:
Time
and Diff
, but does not otherwise change very much. Folks still get to see &Time
and &Diff
, because step one is to call .into_owned()
on the GATs. This introduces some new allocations, and it feels especially bad in cases where we were not previously going to allocate (e.g. using a &Time
) and we should look at each of these. But importantly, this first step opens up more efficient storage at the cost of ephemeral allocations (improving RAM by spending CPU) and may already be worth landing.fn map_times
returning GATs rather than references, which means the storage layer doesn't need to call .into_owned()
any more. In several cases, we still just call .into_owned()
, but we can start to incrementally nibble at uses of GATs that are force into owned versions and replace them with uses of GATs instead. I started this and got tripped up in some places, and so went with a minimal version that unless it is quite sure, just calls into_owned()
and moves on.I think the main remaining question is how much .into_owned()
minimization must we do before landing the PR, or one like it. The into_owned()
calls are a regression for someone storing String
data, say, but are almost certainly a win for anyone who wants a FlatContainer
or Columnation
back-end. In the limit, it would be great to have our internals have Cow<'a, Diff>
everywhere, to only allocate when we need to accumulate, and generally to minimize ephemeral allocations as well. But for next steps, we should probably go over the most egregious .into_owned()
calls, especially when ownership was not previously required, and taking stock of them.
The PR slowly introduces GATs for
Time
andDiff
. It largely reproduces work of @antiguru, but in an incremental fashion that helps me at least understand what is going on. The incremental steps are:Layout
containers to provide separate storage for times and diffs, which allows those containers to speak clearly about associated types for those containers. The GATs are immediately converted to owned data, so the transient allocations are still high (increased, even), but the storage opportunities are increased by allowing the containers to store their data in a way that doesn't need to present as&Time
and&Diff
.DiffGat<'_>
andTimeGat<'_>
respectively, which are exactly the GATs offered by the underlying containers, but surfaced upwards through the APIs provided by cursors. Concretely this means thatmap_times()
communicates GATs rather than references, and users of the function can avoid allocation if they do not require it. These commits did not try especially hard to prevent immediately callinginto_owned()
and re-invoking transient allocations.into_owned()
, which are the moments we convert from GATs to owned types, and where we might perform needless allocation. The commit generalizesSemigroup
to act on references, and e.g. the constraint thatTr::Diff : for<'a> Semigroup<Tr::DiffGat<'a>>
to ensure that we can accumulate the GAT types into the owned types directly, without needing to first take ownership. Similar changes for times, though mostly around streamlining the idiom oftime.join(meet)
for a GATtime
(take ownership, thenjoin_assign
).There is work to continue to progressively nibble away at the reliance on the owned types, though unlike keys and values they are unlike to go away: both time and diff need to be mutated (advancing to frontiers and consolidation, respectively). This seems likely to force some owned representation, as one cannot simply rely on the existing read-only data in the backing storage.