TimelyDataflow / differential-dataflow

An implementation of differential dataflow using timely dataflow on Rust.
MIT License
2.53k stars 182 forks source link

Does all data have to be in memory? #423

Closed oli-w closed 9 months ago

oli-w commented 9 months ago

Hi 👋, I have a general question about Differential Dataflow. One of its uses is to be able to build incrementally maintained views over some data, using the various operators (e.g. Materialize). If I understand correctly, this requires all of the data to be loaded into memory. Is that correct? Assuming that a query could come in at any time that wants to SELECT some of the view's rows.

I'm trying to understand how something like Materialize would be able to scale to huge amounts of data (e.g. 100's of Gigabytes) which won't fit into RAM except for the beefiest of machines, whereas typical databases mostly store their indexes on disk. I assume the only real option is to spin up lots of compute nodes with the data spread between them, each of which acting as a DD worker, orchestrated by Kubernetes or similar?

The only alternatives I can think of are some kind of trick where DD moves things between memory <> disk, or DD calls out to the source tables dynamically depending on the query - both of which I imagine being detrimental to performance or impractical.

frankmcsherry commented 9 months ago

Hello!

To a first approximation, yes vanilla DD as you experience it probably wants things in memory. However, it supports an abstracted Storage type to back its arrangements, and this doesn't have to be "Rust Vec allocations" but can be other forms of storage that allow spilling to disk, or beyond.

As an example, here's a Materialize PR that introduces disk-backed arrangements, by providing a storage container that knows how to move data to and from disk (using @antiguru's lg_alloc crate, behind columnation).

oli-w commented 9 months ago

Oh fantastic! Thank you so much.

oli-w commented 2 months ago

Hi again @frankmcsherry, I've dug more in to better understand the lower levels like arrangements and traces. I can see how a TraceReader could be implemented with a custom Storage implementation like you suggest, that stores the data on disk. E.g. if each TraceReader::Batch was a reference to a separate file.

I was wondering though if it would be possible to read those batches back into the program on the next start, instead of replaying all the data? From what I've gathered the approach in Materialize using lg_alloc stores the data in files to ease memory pressure, but these files are only temporary and all state is re-created from sources on the next start up - "hydration".

I looked at how Collection::arrange_core constructs a new TraceAgent, which returns a TraceWriter. I was thinking that if I have access to the pre-computed set of batches I could insert those batches into the TraceWriter to seed the initial data, and then advance the timestamps to perform subsequent incremental updates. I got this working but wrestled with some of the assumptions about time advancement. Please let me know if this is feasible or if this idea will fall over.

I had a read of https://github.com/TimelyDataflow/differential-dataflow/issues/350 which already covers this topic, where replaying the data is the recommended approach. I'm hoping that bringing in traces from outside the program is an option. Thanks for all your help so far.