Closed teskje closed 5 months ago
I am not sure if there is a clear best way how to implement this. A relatively easy implementation would be to have a SEAL TABLE
command that causes the frontier of the underlying persist collection to be advanced to the empty frontier and disallows further writes to that table. This would allow compute to drop input arrangements in joins, but I think it wouldn't enable the use of monotonic operator variants.
So perhaps we need a new SQL- and compute-level construct to fully support this feature.
At the runtime (and metadata management) layer, does this entail anything else beyond advancing the upper
of the sealed collection to the empty frontier? Are the since
and upper
values the mechanism by which the optimizer will figure out that a GlobalId
represents a data source that will never emit diffs for more than one timestamp?
I don't think the optimizer can derive monotonicity based on the upper
frontier being empty. That's why I mentioned above that a SEAL TABLE
command would probably not let us use the monotonic operators.
With an empty frontier, the optimizer would just see that it's reading a collection with (since, upper)
being (some_time, [])
. Between some_time
and the empty frontier there could still be arbitrary insertions and retractions, so that's not enough to know that the data is monotonic.
Strawman syntax.
ALTER TABLE table_name DISABLE INSERT, UPDATE, DELETE
Two advantages of this syntax.
It can also support marking a table as monotonic
ALTER TABLE table_name DISABLE UPDATE, DELETE
It could in the future be extended to allow making something mutable or non-monotonic again. Doing so would require checking that the table is not currently being referenced, but that should be the same logic as checking if something can be dropped without issuing a cascade.
ALTER TABLE table_name ENABLE INSERT, UPDATE, DELETE
Can constant views be used here? Something like:
CREATE VIEW v AS VALUES (1,2,3), (4,5,6);
Or do those not scale well if the dataset is too large?
Can constant views be used here?
Maybe! I assume that approach would run against our SQL query size limit pretty quickly?
I assume that approach would run against our SQL query size limit pretty quickly?
Yeah, the whole thing SQL query would have to fit into 1MB, which doesn't allow for very big views.
Here's another (potentially bad) proposal. We could extend the COPY
syntax to allow creating a new view from the COPY
result. Something like COPY VIEW static_view (a, b, c) FROM ...
that creates a new view named static_view
that contains the copied data.
Interesting! In this case, how would the further flow look? Would we put a huge create_sql
in the catalog, then put that into the optimizer, then ship a huge dataflow plan to compute to render? I think that would still break down in all kinds of places if the static data is a couple GB in size (which I'm not sure is the case for the user workloads that motivated this issue).
It seems like if we want to support arbitrary sizes we'd have to get the data into persist somehow, so compute can read the collection as usual. So a table mostly does the right thing, except that compute would need to be taught that this type of persist collection is static.
Another possibility would be to create a source type that reads from a URI taken to be a static file, only accepts ENVELOPE NONE
, and emits the empty frontier after processing the input. Perhaps that would allow for some reuse in supporting different FORMAT
specifications for the input data? Extending a table also sounds pretty reasonable, though.
Interesting! In this case, how would the further flow look? Would we put a huge create_sql in the catalog, then put that into the optimizer, then ship a huge dataflow plan to compute to render? I think that would still break down in all kinds of places if the static data is a couple GB in size
Ah yeah, good point. I agree, that would fall over pretty quickly.
cc @frankmcsherry
There's something appealing about @sjwiesman's proposal:
ALTER TABLE table_name DISABLE INSERT, UPDATE, DELETE;
ALTER TABLE table_name DISABLE UPDATE, DELETE;
ALTER TABLE table_name ENABLE INSERT, UPDATE, DELETE;
There's also something appealing about the simplicity of:
SEAL TABLE table_name;
Closing as completed, because this is mostly covered by REFRESH AT CREATION
materialized views.
But note that we are not yet realizing all the possible performance advantages of sealed collections. This is tracked in #23179 and #26571.
Another thing in this discussion which we don't cover yet is unsealing a sealed collection. This sometimes comes up as an idea, but it's hard to implement. I've now created an issue for it (#26572), but this is probably a ways out.
Feature request
We have users that want to import static data sets into Materialize and then join those against other non-static collections. Today they can do that by creating a table and
COPY
ing their static data into it. But if you do this Materialize is not made aware of the fact that the imported data is static, so any dataflows built on top of it need to provision for the possibility that it might change.If we had a way to make it possible for users to mark a collection of data as static, dataflows that build upon that collection could be rendered way more efficiently. For example: