MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

Support for static data collections #21479

Closed teskje closed 5 months ago

teskje commented 1 year ago

Feature request

We have users that want to import static data sets into Materialize and then join those against other non-static collections. Today they can do that by creating a table and COPYing their static data into it. But if you do this Materialize is not made aware of the fact that the imported data is static, so any dataflows built on top of it need to provision for the possibility that it might change.

If we had a way to make it possible for users to mark a collection of data as static, dataflows that build upon that collection could be rendered way more efficiently. For example:

teskje commented 1 year ago

I am not sure if there is a clear best way how to implement this. A relatively easy implementation would be to have a SEAL TABLE command that causes the frontier of the underlying persist collection to be advanced to the empty frontier and disallows further writes to that table. This would allow compute to drop input arrangements in joins, but I think it wouldn't enable the use of monotonic operator variants.

So perhaps we need a new SQL- and compute-level construct to fully support this feature.

aalexandrov commented 1 year ago

At the runtime (and metadata management) layer, does this entail anything else beyond advancing the upper of the sealed collection to the empty frontier? Are the since and upper values the mechanism by which the optimizer will figure out that a GlobalId represents a data source that will never emit diffs for more than one timestamp?

teskje commented 1 year ago

I don't think the optimizer can derive monotonicity based on the upper frontier being empty. That's why I mentioned above that a SEAL TABLE command would probably not let us use the monotonic operators.

With an empty frontier, the optimizer would just see that it's reading a collection with (since, upper) being (some_time, []). Between some_time and the empty frontier there could still be arbitrary insertions and retractions, so that's not enough to know that the data is monotonic.

sjwiesman commented 1 year ago

Strawman syntax.

ALTER TABLE table_name DISABLE INSERT, UPDATE, DELETE

Two advantages of this syntax.

  1. It can also support marking a table as monotonic

    ALTER TABLE table_name DISABLE UPDATE, DELETE
  2. It could in the future be extended to allow making something mutable or non-monotonic again. Doing so would require checking that the table is not currently being referenced, but that should be the same logic as checking if something can be dropped without issuing a cascade.

ALTER TABLE table_name ENABLE INSERT, UPDATE, DELETE
jkosh44 commented 1 year ago

Can constant views be used here? Something like:

CREATE VIEW v AS VALUES (1,2,3), (4,5,6);

Or do those not scale well if the dataset is too large?

teskje commented 1 year ago

Can constant views be used here?

Maybe! I assume that approach would run against our SQL query size limit pretty quickly?

jkosh44 commented 1 year ago

I assume that approach would run against our SQL query size limit pretty quickly?

Yeah, the whole thing SQL query would have to fit into 1MB, which doesn't allow for very big views.

jkosh44 commented 1 year ago

Here's another (potentially bad) proposal. We could extend the COPY syntax to allow creating a new view from the COPY result. Something like COPY VIEW static_view (a, b, c) FROM ... that creates a new view named static_view that contains the copied data.

teskje commented 1 year ago

Interesting! In this case, how would the further flow look? Would we put a huge create_sql in the catalog, then put that into the optimizer, then ship a huge dataflow plan to compute to render? I think that would still break down in all kinds of places if the static data is a couple GB in size (which I'm not sure is the case for the user workloads that motivated this issue).

It seems like if we want to support arbitrary sizes we'd have to get the data into persist somehow, so compute can read the collection as usual. So a table mostly does the right thing, except that compute would need to be taught that this type of persist collection is static.

vmarcos commented 1 year ago

Another possibility would be to create a source type that reads from a URI taken to be a static file, only accepts ENVELOPE NONE, and emits the empty frontier after processing the input. Perhaps that would allow for some reuse in supporting different FORMAT specifications for the input data? Extending a table also sounds pretty reasonable, though.

jkosh44 commented 1 year ago

Interesting! In this case, how would the further flow look? Would we put a huge create_sql in the catalog, then put that into the optimizer, then ship a huge dataflow plan to compute to render? I think that would still break down in all kinds of places if the static data is a couple GB in size

Ah yeah, good point. I agree, that would fall over pretty quickly.

benesch commented 10 months ago

cc @frankmcsherry

There's something appealing about @sjwiesman's proposal:

ALTER TABLE table_name DISABLE INSERT, UPDATE, DELETE;
ALTER TABLE table_name DISABLE UPDATE, DELETE;
ALTER TABLE table_name ENABLE INSERT, UPDATE, DELETE;

There's also something appealing about the simplicity of:

SEAL TABLE table_name;
ggevay commented 5 months ago

Closing as completed, because this is mostly covered by REFRESH AT CREATION materialized views.

But note that we are not yet realizing all the possible performance advantages of sealed collections. This is tracked in #23179 and #26571.

Another thing in this discussion which we don't cover yet is unsealing a sealed collection. This sometimes comes up as an idea, but it's hard to implement. I've now created an issue for it (#26572), but this is probably a ways out.