Closed JanKaul closed 5 months ago
MV concept discussion: https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/
@JanKaul I think it would be great to get this out to the DEV mailing list to get more attention and input from people
Thanks for writing this up, @JanKaul! It's a good idea to specify how to maintain metadata for materialized views.
I think that the approach, to associate a view with some table that stores the materialized version, is a good design choice. And the metadata you have is a great start as well, although I think we can simplify or improve a couple of things.
First, I think we want to avoid keeping much state information in complex table properties. Those aren't designed for the purpose and make the table a bit difficult to use. What I recommend instead reusing the existing snapshot metadata structure to store what you need as snapshot properties. This approach has some really nice features in addition to being a bit simpler.
Each materialized view version is going to be stored in a snapshot, so I think it makes sense to take your idea of a "refresh" and simply store that metadata in snapshot properties. Then we don't need a "current" refresh ID, we can just reuse the current snapshot. Similarly, we wouldn't need a new ID, we could just use the snapshot ID, and the sequence number is automatically associated.
The metadata in snapshot properties would be very similar, but much smaller: | v1 | Snapshot property | Description |
---|---|---|---|
required | view_version_id |
version ID of the view that was materialized | |
required | table.<identifier> |
table UUID and snapshot ID for the table identified by |
In the table, I've also cut out a few of the base table properties...
type
, just rely on everything being an Iceberg table upstream. We may not want to do this, but it makes everything simplefull
refresh).The nice thing about keeping upstream table UUIDs and snapshot IDs in the snapshot metadata is that it allows us to roll back the state of the view along with the upstream tables. For example, if we have an hourly job that produces bad data and an agg MV based on it, it is possible to roll back both the table and the MV to the matching state. We can also do incremental refresh based on the closest materialized snapshot, not just the latest.
I think we would still want some MV metadata in table properties:
v1 | Table property | Description |
---|---|---|
required | materialized_view_format_version |
The MV spec version used |
required | view_identifier |
Identifier for the view that is materialized |
optional | refresh_strategy |
full or incremental , default: full |
We may want additional metadata as well, like a UUID to ensure we have the right view. I don't think we have a UUID in the view spec yet, but we could add one.
I also moved the refresh strategy from the view to the MV table. I think we want to keep as much config on the table as possible, if it may differ between views. I could imagine a case where you might keep both incremental and full materialized versions or might want to have different partitioning specs for materialization, in which case you'd want that set on the table. I think the only thing I'd add to the view itself is the identifier for a materialized table.
The last thing I think we may want to change is to add a section of the proposal for view invalidation. Your property to allow stale data is on the right track, but we can actually detect when a table has not been updated in a way that affects the view query in a lot of cases. For example, if you plan the final query and get input splits for the tables in the view, you can check whether the input is based on a snapshot newer than the MV's base snapshot. If it isn't, then it is safe to use the materialized version. This is a little tricky since you have to account for whether files matching the final filter were deleted, but it should be entirely a metadata operation. I think it would be great to document this as part of a spec.
Thank you @rdblue for your detailed comments. I really like your idea about storing the refresh metadata in the snapshot. It really simplifies the design. I have a couple of questions to better understand your proposal.
To make sure I understood you correctly: you prefer "Design 2: Table + attached common view"?
How exactly would you store the metadata in the snapshot? Would you store the metadata as properties of the summary
field? If the metadata is supposed to be stored as snapshot properties directly, a new `format-version
has to be introduced.
How are multiple upstream table references stored in the metadata? Is the second entry (table.<identifier>
) a list of table UUIDs and snapshot IDs? Because generally the materialized view definition can reference multiple upstream tables.
Regarding the type
field of the base_table
record. I thought it might makes sense to allow future extensions. But you are right that only supporting "iceberg-metastore" tables would allow to further simplify the design.
Thanks for the detailed proposal! Trying to catch up with the conversation here. (btw it would probably be more organized to move this to a google doc or a PR that updates the spec so we can have different threads of discussions, instead of nesting conversations here)
We may want additional metadata as well, like a UUID to ensure we have the right view. I don't think we have a UUID in the view spec yet, but we could add one.
+1, should be added right away so it could be referenced by the MV table
I think we want to keep as much config on the table as possible, if it may differ between views. I could imagine a case where you might keep both incremental and full materialized versions
Why do we really need to have a specific field for refresh method? To me full refresh is always a limitation when incremental cannot be performed. Is there any specific use case where a full refresh is preferred when incremental refresh could be done?
Here is the link to the google doc: https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/
Thanks, added some comments there to kick start discussion
As @jackye1995 said, the first thing we need to do is to decide on a general design. I therefore simplified the section on the design comparison in the google docs. Please feel free to add any points to the pros and cons.
I think refresh strategy is up to the engine to decide, and it could very well depend on some runtime factors/stats. So same view could leverage different refresh methods over time.
I think there could be a hybrid design between Design 1 and Design 2, where some of the MV props (e.g., allow-stale-data
) and pointer to storage table are kept in common view (as it is the case now in Design 1), while underlying table snapshot information are stored in storage table snapshot properties as @rdblue suggested (as it is the case now in Design 2). Reasoning for the pointer direction (view to table or table to view is discussed in this thread in the design doc).
allow-stale-data
could be enhanced to allow-stale-data-within-ms
.
while underlying table snapshot information are stored in storage table snapshot properties
+1, I think we are on the same page on this.
In my view it's still design 1. To me the difference of design 1 and 2 is what you read as the entry point, is that the view, or is that the storage table. So to me it's always the view. Maybe that's the confusion point that I totally missed, what you guys have been saying design 2 is just storing some information in storage table.
Information about refresh definitely makes sense to be stored in the storage table corresponding to each snapshot.
So just want to push the progress forward, I think we have some kind of loose consensus that:
Is that the right understanding? @rdblue @JanKaul @wmoustafa
Agreed. Reverse pointer to the view will be hard to maintain, so I am inclined to not having it.
I would say each view version could optionally map to a new storage table (say the view evolved in a backward incompatible way). Not sure if there is a strong use case for multiple tables for the same view version.
Not sure if there is a strong use case for multiple tables for the same view version.
I am thinking about the case where based on the predicate operating on the view, we can choose intelligently what storage table to use.
I am thinking about the case where based on the predicate operating on the view, we can choose intelligently what storage table to use.
I think this is potentially a generic use case for all tables, not only MV storage tables. Generically speaking, a table (MV or not), identified by a UUID, could have multiple storage layouts, and execution engines can choose the best storage layout. MV still points to a single UUID.
Generically speaking, a table (MV or not), identified by a UUID, could have multiple storage layouts, and execution engines can choose the best storage layout.
That's correct, but technically it could be argued that we can model everything as a view, backed by multiple storage tables, and we don't need to add more complexity into the table spec to support multiple storage layouts.
I don't know if it would work or too crazy, just to throw the idea out that I just came up with:
We could potentially make MV a representation in view spec, in parallel to the SQL representation. So for a materialized view, it could have 2 representations:
[
{
"type" : "sql",
"sql" : "SELECT \"count\"(*) my_cnt\nFROM\n base_tab\n",
"dialect" : "spark",
"schema-id" : 2,
"default-catalog" : "iceberg",
"default-namespace" : [ "anorwood" ]
},
{
"type" : "materialized",
"namespace": "some_namespace"
"table" : "some_storage_table"
}
]
By doing so, you could support a case where 1 view does not really have a SQL representation, but just a few different table layouts:
[
{
"type" : "materialized",
"namespace": "some_namespace"
"table" : "some_storage_table_partitioned_by_col1"
},
{
"type" : "materialized",
"namespace": "some_namespace"
"table" : "some_storage_table_partitioned_by_col2"
}
]
which satisfies the use case described by Walaa.
To clarify, I was saying that multiple representations are outside the scope of MVs, and could be part of standard table spec. Not sure if the proposal above is along the same lines (mentioning since I see materialized
references above).
I am referring to the view spec, using the example here: https://iceberg.apache.org/view-spec/#appendix-a-an-example
So in design 1 where we say we want to have a pointer from view to storage table, the pointer is a new type of representation materialized
.
We already support multiple representations as of today in the view spec, but today it only can have multiple SQL representations.
By adding this new type, it means a view can be backed by multiple storage tables, by adding more materialized
representations.
The use case that you said about a table having multiple storage layout could also be modeled as a view with multiple materialized
representations and no any SQL representation.
It's just a thought, not fully flushed.
I don't know if it would work or too crazy, just to throw the idea out that I just came up with:
We could potentially make MV a representation in view spec, in parallel to the SQL representation. So for a materialized view, it could have 2 representations:
[ { "type" : "sql", "sql" : "SELECT \"count\"(*) my_cnt\nFROM\n base_tab\n", "dialect" : "spark", "schema-id" : 2, "default-catalog" : "iceberg", "default-namespace" : [ "anorwood" ] }, { "type" : "materialized", "namespace": "some_namespace" "table" : "some_storage_table" } ]
By doing so, you could support a case where 1 view does not really have a SQL representation, but just a few different table layouts:
[ { "type" : "materialized", "namespace": "some_namespace" "table" : "some_storage_table_partitioned_by_col1" }, { "type" : "materialized", "namespace": "some_namespace" "table" : "some_storage_table_partitioned_by_col2" } ]
which satisfies the use case described by Walaa.
Great idea! I think this is a very clean solution to store the metadata. However I would rename the type to sql_materialized
in case there are other representations in the future.
@JanKaul if you agree with the summarized consensus we have mostly reached there, for the sake of moving the progress of the discussion forward, could you update the Google doc with the design, and describe the spec changes based on the discussions and suggestions here and combine with your original idea? So that we can start to review the spec contents 💪
Yes, I agree with the proposed design 1. I'm not entirely sure what @rdblue prefers.
I will update the Google doc accordingly.
The next question for me is where and how to store the refresh(freshness) information. There are currently two proposed solutions:
I will create a section in the Google doc that summarizes these options.
React with the following if you prefer one of the solutions: :tada: for solution 1 :rocket: for solution 2
I would +1 on storing in snapshot summary, because:
I also agree with the suggestion of Ryan about the information to store, although I think storing the source tables referenced might be a bit hard, it requires statement analysis of the view.
@wmoustafa @JanKaul
I had some brief discussion about this topic in the community sync this Wednesday, and looking at this thread, I think there is already a consensus about the MV spec. Can we proceed to the actual spec change? Do you have the bandwidth to do that? If not I am also interested in making that change.
+1. We can proceed.
Thank you @jackye1995 for your initiative. I have created a new google doc to discuss the actual specification for materialized views.
Another point I would like to discuss is whether the storage table should be registered in the catalog. I brought this up before but I think people haven't really voted for a solution. Generally the storage table can either be registered in the catalog or not. Most proposals in this thread suggested to register both the view and the storage table in the catalog. I would like to propose to only register the view in the catalog. My reasoning is that a materialized view is one logical entity and should therefore also appear in the catalog as one entity. This is also the behavior of most RDBMS and also cloud data warehouses like snowflake.
Strategy 1: register storage table in catalog | Strategy 2: don't register storage table in catalog | |
---|---|---|
Description | Metadata locations for the view and storage table are stored in the catalog. Since both locations are tracked, each transaction is automatically atomic. | Only the location of the main view is stored in the catalog. A reference to the storage table metadata location is stored in the main view. An atomic update of this reference is the basis for making atomic changes to the storage table metadata file. |
Pros |
|
|
Cons |
|
|
React with the following if you prefer one of the strategies: :tada: for strategy 1 :rocket: for strategy 2
Agree it makes more sense to store as one logical entity. Also I think to register with two names, we would have to assign the view and the table different names. The consequences of that are not clear.
I think it has to be a catalog-specific decision, because as you said Snowflake has that hidden, but for example Trino view on Hive/Glue has the storage table exposed. So I am not sure if it needs to be a part of the spec or not.
I think these are different type of catalogs, typically we distinguish them as business data catalog and technical data catalog. The former is more oriented for end user, whereas the later is more oriented to technical users that would like to see all the hidden stuffs.
From implementation perspective, why does approach 2 need "definition of commit procedure" but not in 1? I think it's just a matter of hide the storage table or not, but there will always be 2 objects behind the scene.
Btw, @nastra is already doing a proposal of multi-table transaction, so I think that problem will be solved there and we don't need to worry in the MV spec.
It would be great if we could make it a catalog-specific decision. But for that the metadata has to be designed to enable both strategies.
I think the question is what do we use as an unique identifier for the storage table in the representation of the common view?
One approach is to use table_name, namespace and catalog as unique identifier. But this only works if the storage table is registered in the catalog. Furthermore, without being registered in the catalog, an atomic swap of the storage table metadata file cannot be guaranteed.
Another approach would be to use the storage table metadata file location as unique identifier. This makes using the catalog a bit more awkward because the tables have to be filtered for the metadata file location. But this approach wouldn't require the storage table to be registered in the catalog. And it would enable atomic transactions on the storage table by making atomic changes to the storage table metadata file location. This is what I meant by commit procedure.
I think using the user-facing table namespace as an identifier is catalog specific, so I do not think it can make it to the spec. Maybe we should answer:
I think this boils down to engines awareness of views being connected to storage tables, and navigating to the table scan APIs from a view entry point. I think using the table storage location (or table UUID) in the view still serves this requirement.
because the tables have to be filtered for the metadata file location.
Could you clarify this a bit more?
When you want to use the metadata file location to retrieve the storage table from a catalog you would need to search all entries for one where the location field of the metadata matches the provided location. Not all catalogs might be capable of such an operation.
I see. I think this goes back to my previous comment that the engine catalog APIs will have to be able to resolve views as tables. They will simply navigate to the storage table location from the view as an entry point. Engine catalog APIs may have to evolve to do that, along with view resolution implementation. @jackye1995 do you agree this is the workflow? @JanKaul, does this make filtering the catalog by table location not required?
I'm not entirely sure if I understand you correctly. I would expect a query engine catalog to proceed according to the following procedure:
Find view metadata according to namespace and view name
If query engine doesn't support materialized views, then execute view query.
If query engine supports materialized views, then use storage table pointer (table_name+namespace, metadata location, UUID) to access storage table metadata. If the storage table metadata location is used, retrieving the metadata is easy. If namespace+table_name is used, the metadata has to be fetched from the iceberg or query engine catalog.
Use storage table metadata to execute TableScan operator.
Ideally, one would implement some kind of query engine catalog interface for the iceberg catalog. So that the iceberg catalog can be used directly.
That is correct @JanKaul.
Another point I would like to discuss is whether the storage table should be registered in the catalog. I brought this up before but I think people haven't really voted for a solution. Generally the storage table can either be registered in the catalog or not. Most proposals in this thread suggested to register both the view and the storage table in the catalog. I would like to propose to only register the view in the catalog. My reasoning is that a materialized view is one logical entity and should therefore also appear in the catalog as one entity. This is also the behavior of most RDBMS and also cloud data warehouses like snowflake. Strategy 1: register storage table in catalog Strategy 2: don't register storage table in catalog Description Metadata locations for the view and storage table are stored in the catalog. Since both locations are tracked, each transaction is automatically atomic. Only the location of the main view is stored in the catalog. A reference to the storage table metadata location is stored in the main view. An atomic update of this reference is the basis for making atomic changes to the storage table metadata file. Pros
* Storage table can be addressed in the catalog * Simple implementation * only one entry for materialized view
Cons
* 2 entities appear in the catalog * definition of commit procedure
React with the following if you prefer one of the strategies: tada for strategy 1 rocket for strategy 2
@rdblue do you have any preference for which unique identifier to use for referencing the storage table?
I have an idea how we could define a general specification for the "storage table pointer" that would allow to decide on a case by case basis whether to register the storage table in the catalog or not.
We could use a storage table pointer of the form:
protocol
://
identifier
For example:
catalog://mycatalog.public.table1
for a storage table that is registered in a catalogs3://bucket/home/iceberg/warehouse/nyc/taxis/metadata/00000-8a62c37d-4573-4021-952a-c0baef7d21d0.metadata.json
for referencing the storage table by metadata locationI do not think the Iceberg metadata should "point back" to the catalog IDs (first option). What is the disadvantage of just going with the location of the storage table (last option, but without the protocol part)?
Any update on this?
I think we almost have a consensus about the Materialized View Spec. I brought up one issue with the pointer to the storage table that is holding up the progress at the moment. The issue is whether to use the table identifier (catalog, namespace, name) or the table metadata location and therefore whether to register the storage table in the catalog.
I've come to realize that using the table identifier is the only solution that works for all iceberg catalogs because using the table metadata location doesn't work with the REST catalog.
So in my opinion we can settle my issue and use the table identifier as the pointer to the storage table. We would therefore then need to store the storage table in the catalog.
With this we should have a consensus on the Spec. I will update the documents and then everyone can have another look.
I've updated the issue description and the google doc (https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing).
I would love to get your feedback. @liurenjie1024 @jackye1995 @wmoustafa @rdblue
@wmoustafa I hope you are okay with using the table identifier as the storage table pointer instead of using the metadata location. But I don't see a way to use the metadata location with the REST catalog. And ultimately we require a robust solution that works across all catalogs.
Could you clarify how it does not work with the catalog API? At the end of the day, the catalog API takes a table identifier and returns an object. That can be table object, virtual view object, or a hybrid object (has characteristics of both table and virtual view objects). In the case of materialized view, the catalog should be able to take a view name, i.e., identifier, and return that hybrid object. So both the virtual view and storage table are addressed by the view name. That said, I am still not clear why the Iceberg spec needs to be aware of the catalog-specific table identifier. Table identifiers are not part of the Iceberg table spec, so I do not think they can make it to the materialized view spec. (Table identifier as in catalog.db.table
should not be confused with table UUID -- the latter is part of the V2 spec which should be fine to use).
You are right, I was confused because the REST catalog stores the metadata internally and I thought this would include the storage table. But it would be possible to store the metadata for the view in the REST catalog and still store the metadata for the storage table in a metadata.json
file. This way the metadata-location could still be used as a pointer to the storage table. Thanks for thinking this through again.
This leaves us again with the question of what to use for the storage table pointer.
This leaves us again with the question of what to use for the storage table pointer.
I think table UUID or location is fine. Same should apply when referencing the base tables and their respective snapshot IDs. Maybe for base tables, UUID makes more sense.
Thanks for your input. The discussion has moved to the Google doc (https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing). It would be great if you could add your comment there. Then it will have a higher visibility.
@jackye1995 it would be great if you could have a look at the google doc.
Hi @JanKaul . Thanks for putting this together. I went through the detailed discussion and left my votes. I see there is general consensus to the "Open Questions" in the design docs, which are:
Is that correct? Then given that, as it takes a bit of time to parse the discussion to see the actual changes, I went ahead and summarized the additions we are making to the current metadata spec below (taken from your doc).
v1 | v2 | Field Name | Description |
---|---|---|---|
optional | materialization | An optional materialization struct. If the value is null the entity is a common view, otherwise it is a materialized view |
Materialization Struct | v1 | Field Name | Description |
---|---|---|---|
required | format-version | An integer version number for the materialized view format. Currently, this must be 1. Implementations must throw an exception if the materialized view's version is higher than the supported version. | |
required | storage-table | Table metadata location |
v1 | Field Name | Description |
---|---|---|
optional | refresh-version-id | Version id of the materialized view when the refresh operation was performed. |
optional | source-tables | A List of source-table records. |
Source Table Struct | v1 | Field Name | Description |
---|---|---|---|
required | identifier | Identifier of the table as defined in the SQL expression. | |
required | snapshot-id | Snapshot id of the source table when the last refresh operation was performed. |
Let me know if that looks right.
My 2c on this are:
If there is general consensus on the direction, it'd be great to move to on the actual spec pr change and discuss specifics there, as seems like this proposal has been sitting awhile? I can also help with that, if needed. Thanks.
Hi @szehon-ho, thanks for trying to move the process of reaching consensus along. To be honest, I don't know how the community normally reaches consensus on these kinds of topics. But I still have the feeling that we are lacking the feedback from some key stakeholders. Until now I was waiting to get more feedback but it seems like creating the PRs is the right thing to move the proposal forward.
In any case it might be good to bring this up at a community sync.
I would argue for 2 PRs, one for the changes to the view metadata and one for the changes to the table metadata.
Great, this all makes sense to me except following points:
- Question: Most people seem to be for Option 1
As we clarified in the google doc, it seems (at least among the local thread) we are moving to Option 2: https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1&disco=AAABFByy0mU
- As can be seen in the last point of https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.m5kli4l5q7ui, we agreed to leave the refresh strategy to the query engine. If available, an incremental refresh is always preferable to a full refresh.
Did we consider making this property configurable? We should at least store what the materialized view refresh strategy was used in the snapshot for debugging? We can discuss on the google doc.
So Im open to either, we can make a PR, or fill out the Specification Draft part of the doc ? The main barrier right now to more folks reviewing is that the doc is just a list of options and long threads that makes it really hard to see where the consensus is, it took me quite some time to follow. Having a central section of what concrete additions we are proposing will greatly streamline the review experience. Feel free to take my summarized spec changes in my previous comment as a starting point.
Let me know if you need help with it. Thanks
Sorry, for Question 3 I mixed up Options 1 & 2. You are right there is consensus on 2.
Great, it seems like all open questions have been answered. I will update the Specification Draft accordingly. Additionally we can create the PRs to make the spec changes. If you are interested you can create the PRs otherwise I will do it tomorrow.
Iceberg Materialized View Spec
Materialized View Spec google doc: https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing
Background and Motivation
A materialized view is a database entity that stores a query definition as a logical table. The query is precomputed and the resulting data is served when the materialized view is queried. The cost of query execution is pushed to the precomputation step and is amortized over the query executions.
Goal
A common metadata format for materialized views enabling materialized views to be created, read and updated by different query engines.
Overview
The metadata of a MV (materialized view) consists of a query definition, a pointer to the precomputed data and lineage information to check if the MV is outdated. Iceberg materialized views are realized as a combination of an iceberg view with an underlying iceberg table that stores the precomputed data. The iceberg view is referred to as the “common view”, while the table is labeled “storage table”. The query definition of the materialized view is stored in the common view. The precomputed data is stored in the storage table. The storage table can have the states “fresh”, “outdated” or “invalid”. The lineage information is composed of data about the so-called “source tables”, which are the tables referenced in the query definition of the materialized view.
Operations on Materialized Views
Creation of materialized views
When a materialized view is created, an iceberg view with the query definition and an iceberg table to store the precomputed data is created. The storage table pointer of the view is set to the storage table.
Determining the storage table state
To determine the storage table state the current-version-id of the view is compared to the refresh-version-id in the lineage struct of the current storage table snapshot. If the versions don’t match, the storage table state is invalid. If the versions match, the snapshot ids of the last refresh operation, which are stored in the lineage information, are compared to the current snapshot ids of the source tables. If the snapshot_id’s match for all source tables, the storage table state is fresh, otherwise it is outdated.
Refresh of materialized views
The exact mechanism for triggering a refresh is left up to individual systems. It might be triggered in a variety of ways. Some examples:
The mechanism to perform the refresh operation uses the storage table state, which is determined by the procedure described above, to decide which action to take.
Query of materialized views
As a first step the storage table state is determined by following the procedure described above.
Commit Procedure
The catalog provides the basis for making atomic changes to the view metadata. Readers use the version of the view that was current when they loaded the view metadata and are not affected by changes until they refresh and pick up the new metadata.
Writers distinguish between changing the view or the storage table state.
Writers changing the view state, perform an “update view” operation using the catalog, optimistically assuming the view hasn’t changed since the metadata was loaded.
Writers changing the storage table state perform the commit in two steps. First, the writer creates a new table metadata file optimistically, assuming that the storage-table-pointer of the view will not be changed before the writer’s commit. It then updates the storage-table-pointer to the new location of table metadata. Second, the new view metadata gets committed to the catalog, requiring that the current storage-table-pointer of the catalog is equal to the previous storage-table-pointer. The commit is only successful when the second step succeeds.
Specification (Draft)
Terms
Metadata
The metadata for a materialized view can be considered an extension to the view metadata. To account for possible precomputed data a materialization field is added to the view spec. The materialization field stores the “storage table pointer” which references the location of the metadata.json file of the storage table. The following table shows the view metadata according to the v1 view spec compared to this proposal.
view-uuid
format-version
location
schemas
current-version-id
version-id
)versions
version-log
version-id
for every change tocurrent-version-id
properties
materialization
Notes:
Storage Table Metadata
The lineage information, that is required to determine whether the storage table contains fresh data, is stored in the form of a lineage record in each storage table snapshot. The following table shows the snapshot metadata according to the v2 table spec compared to this proposal.
snapshot-id
parent-snapshot-id
sequence-number
timestamp-ms
manifest-list
summary
operation
(see below)schema-id
lineage
Lineage record
The lineage record has the following fields:
refresh-version-id
source-tables
source-table
records.Source table record
The source table record has the following fields:
uuid
identifier
snapshot-id
Full Identifier record
The source table record has the following fields:
catalog
namespace
table-name
View Properties
Additional view properties that allow the configuration of the materialized view:
materialization.data.allow-stale
false
Query engine metadata
To enable query engines to write metadata for logging and debugging, the query engine can write the following fields to the snapshot summary of the storage table:
materialization-refresh-strategy