MaterializeInc / materialize

The data warehouse for operational workloads.
https://materialize.com
Other
5.67k stars 458 forks source link

Concurrent dataflow submission #318

Open frankmcsherry opened 4 years ago

frankmcsherry commented 4 years ago

At the moment we have the ability for DataflowCommand::CreateDataflows to contain multiple dataflow definitions, used at the moment only for installing multiple sources in the same instruction.

It is reasonable to want to support multiple dataflows in the same atomic instruction, so that users can avoid unwanted interleavings that could result from using the same names (because afaik randoms can just drop and re-install each others' views). In this case, it is reasonable to think of the semantics as being the sequential installation of each of the views.

It is less reasonable, but way more awesome, to adopt an extended semantics in which all views are immediately available for use, and their resulting contents are then the repeated application of the supplied rules. The only case in which this is different is when one view definition has a Get for another view that is not defined strictly before it (so, either itself, or a subsequent view). In this case we would effect recursive computation.

The extended semantics strictly generalize the sequential semantics (in which view definitions should not reference views that do not exist, nor overwrite an existing view without an explicit DROP). At the same time, I figured someone might get cranky if I just went and added recursive queries to our execution environment without telling anyone.

cc @jamii

benesch commented 4 years ago

I would be the opposite of cranky if you added that! Although I do think that there are perhaps two actionable items here that could be tracked separately: DDL transactions and recursive queries. For DDL transactions we’d need to go one step further and bundle together arbitrary sequences of creates and drops.

frankmcsherry commented 4 years ago

In further discussion, an important semantic difference was observed!

Creating multiple dataflows should result in several independent dataflows that can each be independently dropped. The necessary concept for recursion would be a single dataflow that publishes multiple views, but which are collectively fate-shared wrt dropping.

benesch commented 3 years ago

I'm going to merge #3794 into this issue, because they are essentially the same. More design work is required, but the concrete design that's been kicking around for a while is something like this:

BEGIN;
CREATE SOURCE foo FROM kafka;
CREATE MATERIALIZED VIEW bar AS SELECT .. FROM foo;
CREATE MATERIALIZED VIEW quux AS SELECT .. FROM foo;
COMMIT;

A simpler idea that's been proposed is something like:

CREATE SOURCE foo FROM kafka;
CREATE VIEW bar AS SELECT .. FROM foo;
CREATE VIEW quux AS SELECT .. FROM foo;
CREATE INDEXES
    DEFAULT INDEX ON bar,
    DEFAULT INDEX ON quux

where that CREATE INDEXES command atomically creates the necessary dataflows but does not require the complications of supporting SQL's interactive transactions.