brurucy / pydbsp

This library provides an implementation of the DBSP language for incremental streaming computations.
90 stars 1 forks source link

Question: What makes DBSP less expressive than Differential Dataflow #6

Open theSherwood opened 4 weeks ago

theSherwood commented 4 weeks ago

Hi. Thanks for this great repo. I've played around with making a toy implementation of some of Differential Dataflow and then I became aware of the DBSP papers. The DBSP approach seems more systematic, but I don't have the background to be able to effectively categorize the differences.

DBSP is differential dataflow's less expressive successor.

Could you explain why this is the case? I am looking to try to compare and contrast the two approaches.

brurucy commented 2 weeks ago

Hello there @theSherwood.

Thank you so much for checking the library out.

Your issue got buried, sorry to take so long to answer!

Anyhow,

Could you explain why this is the case?

Sure.

Differential Dataflow (DD) is a programming language, referred to as a "Framework" because in its current form it is embedded in Rust, that extends/is implemented with Timely Dataflow (TD).

TD is a language for computation distribution. Think of it as a toolkit to efficiently "distribute" any computation.

DD on the other hand is a high-level language for writing expressive (can recur) incremental programs, where the "incrementality" of the program is made transparent to the user.

DD's computation graphs are "distributed" with TD. Because of this, DD inherits certain aspects of TD, such as its timestamp/event time sorcery.

DBSP however does not contain any timestamp/event time sorcery. "Time" in DBSP is the time of arrival of some "event", and all "events" are assumed to arrive on time.

This is a big difference with respect to DD, where each "event" may arrive whenever (even in the past).

In short, DBSP is a simplification of DD with most of what has been simplified being TD's inherited semantics.

theSherwood commented 2 weeks ago

@brurucy Thanks for that explanation.

I remember the timestamp sorcery. It seemed like it was making time an N-dimensional space, which was sort of a mismatch for my purposes. And having to manage time in user code was not very ergonomic. I seem to remember the user has to signal when there will be no more events before a certain frontier or something like that so that compaction can take place and certain operators can advance.

Interesting that DBSP does away with all the event time stuff. That seems like it would really simplify implementations. How does it handle iteration/recursion without the timestamps? I seem to remember DD's timestamps being an important part of how that worked, but maybe I've got that wrong.

What I would quite like is an approach in which time is tree-like. It makes time more expensive in practice than what you get with DD's N-dimensional spaces and is suited to a different purpose. Rather than DD's time being useful for coordinating different services, tree-like time would be useful for a git-like/collaborative approach to state. So I'm trying to evaluate whether DD or DBSP is the better starting point.