apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.38k stars 2.2k forks source link

[Proposal] Iceberg Materialized View Spec #6420

Closed JanKaul closed 5 months ago

JanKaul commented 1 year ago

Iceberg Materialized View Spec

Materialized View Spec google doc: https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing

Background and Motivation

A materialized view is a database entity that stores a query definition as a logical table. The query is precomputed and the resulting data is served when the materialized view is queried. The cost of query execution is pushed to the precomputation step and is amortized over the query executions.

Goal

A common metadata format for materialized views enabling materialized views to be created, read and updated by different query engines.

Overview

The metadata of a MV (materialized view) consists of a query definition, a pointer to the precomputed data and lineage information to check if the MV is outdated. Iceberg materialized views are realized as a combination of an iceberg view with an underlying iceberg table that stores the precomputed data. The iceberg view is referred to as the “common view”, while the table is labeled “storage table”. The query definition of the materialized view is stored in the common view. The precomputed data is stored in the storage table. The storage table can have the states “fresh”, “outdated” or “invalid”. The lineage information is composed of data about the so-called “source tables”, which are the tables referenced in the query definition of the materialized view.

classDiagram
    View --|> Table
    class View{
        +query definition
        +storage table pointer
    }
    class Table{
        +precomputed data
        +lineage information
    }

Operations on Materialized Views

Creation of materialized views

When a materialized view is created, an iceberg view with the query definition and an iceberg table to store the precomputed data is created. The storage table pointer of the view is set to the storage table.

Determining the storage table state

To determine the storage table state the current-version-id of the view is compared to the refresh-version-id in the lineage struct of the current storage table snapshot. If the versions don’t match, the storage table state is invalid. If the versions match, the snapshot ids of the last refresh operation, which are stored in the lineage information, are compared to the current snapshot ids of the source tables. If the snapshot_id’s match for all source tables, the storage table state is fresh, otherwise it is outdated.

Refresh of materialized views

The exact mechanism for triggering a refresh is left up to individual systems. It might be triggered in a variety of ways. Some examples:

The mechanism to perform the refresh operation uses the storage table state, which is determined by the procedure described above, to decide which action to take.

Query of materialized views

As a first step the storage table state is determined by following the procedure described above.

Commit Procedure

The catalog provides the basis for making atomic changes to the view metadata. Readers use the version of the view that was current when they loaded the view metadata and are not affected by changes until they refresh and pick up the new metadata.

Writers distinguish between changing the view or the storage table state.

Writers changing the view state, perform an “update view” operation using the catalog, optimistically assuming the view hasn’t changed since the metadata was loaded.

Writers changing the storage table state perform the commit in two steps. First, the writer creates a new table metadata file optimistically, assuming that the storage-table-pointer of the view will not be changed before the writer’s commit. It then updates the storage-table-pointer to the new location of table metadata. Second, the new view metadata gets committed to the catalog, requiring that the current storage-table-pointer of the catalog is equal to the previous storage-table-pointer. The commit is only successful when the second step succeeds.

Specification (Draft)

Terms

Metadata

The metadata for a materialized view can be considered an extension to the view metadata. To account for possible precomputed data a materialization field is added to the view spec. The materialization field stores the “storage table pointer” which references the location of the metadata.json file of the storage table. The following table shows the view metadata according to the v1 view spec compared to this proposal.

v1 materialized view Field name Description
required required view-uuid A UUID that identifies the view, generated when the view is created. Implementations must throw an exception if a view's UUID does not match the expected UUID after refreshing metadata
required required format-version An integer version number for the view format; must be 1
required required location The view's base location; used to create metadata file locations
required required schemas A list of known schemas
required required current-version-id ID of the current version of the view (version-id)
required required versions A list of known versions of the view
required required version-log A list of version log entries with the timestamp and version-id for every change to current-version-id
optional required properties A string to string map of view properties
optional materialization The storage table pointer[1]; used to retrieve the precomputed data from the storage table. If the value is null the entity is a common view, otherwise it is a materialized view

Notes:

  1. Storage table pointer: The metadata of the storage table is stored in a metadata.json file and must not be stored in a catalog. The storage table pointer contains the location of the metadata.json file.

Storage Table Metadata

The lineage information, that is required to determine whether the storage table contains fresh data, is stored in the form of a lineage record in each storage table snapshot. The following table shows the snapshot metadata according to the v2 table spec compared to this proposal.

v2 materialized view Field Description
required required snapshot-id A unique long ID
optional optional parent-snapshot-id The snapshot ID of the snapshot's parent. Omitted for any snapshot with no parent
required required sequence-number A monotonically increasing long that tracks the order of changes to a table
required required timestamp-ms A timestamp when the snapshot was created, used for garbage collection and table inspection
required required manifest-list The location of a manifest list for this snapshot that tracks manifest files with additional metadata
required required summary A string map that summarizes the snapshot changes, including operation (see below)
optional optional schema-id ID of the table's current schema when the snapshot was created
optional lineage A lineage record containing freshness information for materialized views. Optional field that only applies if the table is a storage table for a materialized view.

Lineage record

The lineage record has the following fields:

v1 Field Name Description
required refresh-version-id Version id of the materialized view when the refresh operation was performed.
required source-tables A List of source-table records.

Source table record

The source table record has the following fields:

v1 Field Name Description
required uuid Uuid of the source table.
required identifier A full identifier record containing catalog, namespace and table-name fields.
required snapshot-id Snapshot id of the source table when the refresh operation was performed.

Full Identifier record

The source table record has the following fields:

v1 Field Name Description
required catalog Catalog of the source table
required namespace Namespace of the source table
required table-name Name of the source table

View Properties

Additional view properties that allow the configuration of the materialized view:

Property Default Description
materialization.data.allow-stale false Boolean that defines the query engine's behavior in case the source tables indicate the precomputed data isn't fresh. If set to false, a refresh operation has to be performed before the query results are returned. If set to true the data in the storage table gets returned without performing a refresh operation.

Query engine metadata

To enable query engines to write metadata for logging and debugging, the query engine can write the following fields to the snapshot summary of the storage table:

Property Description
materialization-refresh-strategy Which refresh strategy was used to perform the refresh operation. Can either be “FULL” or “INCREMENTAL”.
JanKaul commented 1 year ago

MV concept discussion: https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/

nastra commented 1 year ago

@JanKaul I think it would be great to get this out to the DEV mailing list to get more attention and input from people

rdblue commented 1 year ago

Thanks for writing this up, @JanKaul! It's a good idea to specify how to maintain metadata for materialized views.

I think that the approach, to associate a view with some table that stores the materialized version, is a good design choice. And the metadata you have is a great start as well, although I think we can simplify or improve a couple of things.

First, I think we want to avoid keeping much state information in complex table properties. Those aren't designed for the purpose and make the table a bit difficult to use. What I recommend instead reusing the existing snapshot metadata structure to store what you need as snapshot properties. This approach has some really nice features in addition to being a bit simpler.

Each materialized view version is going to be stored in a snapshot, so I think it makes sense to take your idea of a "refresh" and simply store that metadata in snapshot properties. Then we don't need a "current" refresh ID, we can just reuse the current snapshot. Similarly, we wouldn't need a new ID, we could just use the snapshot ID, and the sequence number is automatically associated.

The metadata in snapshot properties would be very similar, but much smaller: v1 Snapshot property Description
required view_version_id version ID of the view that was materialized
required table.<identifier> table UUID and snapshot ID for the table identified by that was read

In the table, I've also cut out a few of the base table properties...

The nice thing about keeping upstream table UUIDs and snapshot IDs in the snapshot metadata is that it allows us to roll back the state of the view along with the upstream tables. For example, if we have an hourly job that produces bad data and an agg MV based on it, it is possible to roll back both the table and the MV to the matching state. We can also do incremental refresh based on the closest materialized snapshot, not just the latest.

I think we would still want some MV metadata in table properties:

v1 Table property Description
required materialized_view_format_version The MV spec version used
required view_identifier Identifier for the view that is materialized
optional refresh_strategy full or incremental, default: full

We may want additional metadata as well, like a UUID to ensure we have the right view. I don't think we have a UUID in the view spec yet, but we could add one.

I also moved the refresh strategy from the view to the MV table. I think we want to keep as much config on the table as possible, if it may differ between views. I could imagine a case where you might keep both incremental and full materialized versions or might want to have different partitioning specs for materialization, in which case you'd want that set on the table. I think the only thing I'd add to the view itself is the identifier for a materialized table.

The last thing I think we may want to change is to add a section of the proposal for view invalidation. Your property to allow stale data is on the right track, but we can actually detect when a table has not been updated in a way that affects the view query in a lot of cases. For example, if you plan the final query and get input splits for the tables in the view, you can check whether the input is based on a snapshot newer than the MV's base snapshot. If it isn't, then it is safe to use the materialized version. This is a little tricky since you have to account for whether files matching the final filter were deleted, but it should be entirely a metadata operation. I think it would be great to document this as part of a spec.

JanKaul commented 1 year ago

Thank you @rdblue for your detailed comments. I really like your idea about storing the refresh metadata in the snapshot. It really simplifies the design. I have a couple of questions to better understand your proposal.

  1. To make sure I understood you correctly: you prefer "Design 2: Table + attached common view"?

  2. How exactly would you store the metadata in the snapshot? Would you store the metadata as properties of the summary field? If the metadata is supposed to be stored as snapshot properties directly, a new `format-version has to be introduced.

  3. How are multiple upstream table references stored in the metadata? Is the second entry (table.<identifier>) a list of table UUIDs and snapshot IDs? Because generally the materialized view definition can reference multiple upstream tables.

Regarding the type field of the base_table record. I thought it might makes sense to allow future extensions. But you are right that only supporting "iceberg-metastore" tables would allow to further simplify the design.

jackye1995 commented 1 year ago

Thanks for the detailed proposal! Trying to catch up with the conversation here. (btw it would probably be more organized to move this to a google doc or a PR that updates the spec so we can have different threads of discussions, instead of nesting conversations here)

We may want additional metadata as well, like a UUID to ensure we have the right view. I don't think we have a UUID in the view spec yet, but we could add one.

+1, should be added right away so it could be referenced by the MV table

I think we want to keep as much config on the table as possible, if it may differ between views. I could imagine a case where you might keep both incremental and full materialized versions

Why do we really need to have a specific field for refresh method? To me full refresh is always a limitation when incremental cannot be performed. Is there any specific use case where a full refresh is preferred when incremental refresh could be done?

JanKaul commented 1 year ago

Here is the link to the google doc: https://docs.google.com/document/d/1QAuy-meSZ6Oy37iPym8sV_n7R2yKZOHunVR-ZWhhZ6Q/

jackye1995 commented 1 year ago

Thanks, added some comments there to kick start discussion

JanKaul commented 1 year ago

As @jackye1995 said, the first thing we need to do is to decide on a general design. I therefore simplified the section on the design comparison in the google docs. Please feel free to add any points to the pros and cons.

wmoustafa commented 1 year ago

I think refresh strategy is up to the engine to decide, and it could very well depend on some runtime factors/stats. So same view could leverage different refresh methods over time.

I think there could be a hybrid design between Design 1 and Design 2, where some of the MV props (e.g., allow-stale-data) and pointer to storage table are kept in common view (as it is the case now in Design 1), while underlying table snapshot information are stored in storage table snapshot properties as @rdblue suggested (as it is the case now in Design 2). Reasoning for the pointer direction (view to table or table to view is discussed in this thread in the design doc).

allow-stale-data could be enhanced to allow-stale-data-within-ms.

jackye1995 commented 1 year ago

while underlying table snapshot information are stored in storage table snapshot properties

+1, I think we are on the same page on this.

In my view it's still design 1. To me the difference of design 1 and 2 is what you read as the entry point, is that the view, or is that the storage table. So to me it's always the view. Maybe that's the confusion point that I totally missed, what you guys have been saying design 2 is just storing some information in storage table.

Information about refresh definitely makes sense to be stored in the storage table corresponding to each snapshot.

jackye1995 commented 1 year ago

So just want to push the progress forward, I think we have some kind of loose consensus that:

  1. view + storage table is likely the general approach to go
  2. view stores pointer to the storage table, or potentially multiple storage tables
  3. storage table should store the information about each refresh as table snapshot properties, some reverse pointer to the view is also good to have

Is that the right understanding? @rdblue @JanKaul @wmoustafa

wmoustafa commented 1 year ago

Agreed. Reverse pointer to the view will be hard to maintain, so I am inclined to not having it.

I would say each view version could optionally map to a new storage table (say the view evolved in a backward incompatible way). Not sure if there is a strong use case for multiple tables for the same view version.

jackye1995 commented 1 year ago

Not sure if there is a strong use case for multiple tables for the same view version.

I am thinking about the case where based on the predicate operating on the view, we can choose intelligently what storage table to use.

wmoustafa commented 1 year ago

I am thinking about the case where based on the predicate operating on the view, we can choose intelligently what storage table to use.

I think this is potentially a generic use case for all tables, not only MV storage tables. Generically speaking, a table (MV or not), identified by a UUID, could have multiple storage layouts, and execution engines can choose the best storage layout. MV still points to a single UUID.

jackye1995 commented 1 year ago

Generically speaking, a table (MV or not), identified by a UUID, could have multiple storage layouts, and execution engines can choose the best storage layout.

That's correct, but technically it could be argued that we can model everything as a view, backed by multiple storage tables, and we don't need to add more complexity into the table spec to support multiple storage layouts.

jackye1995 commented 1 year ago

I don't know if it would work or too crazy, just to throw the idea out that I just came up with:

We could potentially make MV a representation in view spec, in parallel to the SQL representation. So for a materialized view, it could have 2 representations:

[
  {
      "type" : "sql",
      "sql" : "SELECT \"count\"(*) my_cnt\nFROM\n  base_tab\n",
      "dialect" : "spark",
      "schema-id" : 2,
      "default-catalog" : "iceberg",
      "default-namespace" : [ "anorwood" ]
    },
    {
      "type" : "materialized",
      "namespace": "some_namespace"
      "table" : "some_storage_table"
    }
]

By doing so, you could support a case where 1 view does not really have a SQL representation, but just a few different table layouts:

[
  {
      "type" : "materialized",
      "namespace": "some_namespace"
      "table" : "some_storage_table_partitioned_by_col1"
   },
  {
      "type" : "materialized",
      "namespace": "some_namespace"
      "table" : "some_storage_table_partitioned_by_col2"
    }
]

which satisfies the use case described by Walaa.

wmoustafa commented 1 year ago

To clarify, I was saying that multiple representations are outside the scope of MVs, and could be part of standard table spec. Not sure if the proposal above is along the same lines (mentioning since I see materialized references above).

jackye1995 commented 1 year ago

I am referring to the view spec, using the example here: https://iceberg.apache.org/view-spec/#appendix-a-an-example

So in design 1 where we say we want to have a pointer from view to storage table, the pointer is a new type of representation materialized.

We already support multiple representations as of today in the view spec, but today it only can have multiple SQL representations.

By adding this new type, it means a view can be backed by multiple storage tables, by adding more materialized representations.

The use case that you said about a table having multiple storage layout could also be modeled as a view with multiple materialized representations and no any SQL representation.

It's just a thought, not fully flushed.

JanKaul commented 1 year ago

I don't know if it would work or too crazy, just to throw the idea out that I just came up with:

We could potentially make MV a representation in view spec, in parallel to the SQL representation. So for a materialized view, it could have 2 representations:

[
  {
      "type" : "sql",
      "sql" : "SELECT \"count\"(*) my_cnt\nFROM\n  base_tab\n",
      "dialect" : "spark",
      "schema-id" : 2,
      "default-catalog" : "iceberg",
      "default-namespace" : [ "anorwood" ]
    },
    {
      "type" : "materialized",
      "namespace": "some_namespace"
      "table" : "some_storage_table"
    }
]

By doing so, you could support a case where 1 view does not really have a SQL representation, but just a few different table layouts:

[
  {
      "type" : "materialized",
      "namespace": "some_namespace"
      "table" : "some_storage_table_partitioned_by_col1"
   },
  {
      "type" : "materialized",
      "namespace": "some_namespace"
      "table" : "some_storage_table_partitioned_by_col2"
    }
]

which satisfies the use case described by Walaa.

Great idea! I think this is a very clean solution to store the metadata. However I would rename the type to sql_materialized in case there are other representations in the future.

jackye1995 commented 1 year ago

@JanKaul if you agree with the summarized consensus we have mostly reached there, for the sake of moving the progress of the discussion forward, could you update the Google doc with the design, and describe the spec changes based on the discussions and suggestions here and combine with your original idea? So that we can start to review the spec contents 💪

JanKaul commented 1 year ago

Yes, I agree with the proposed design 1. I'm not entirely sure what @rdblue prefers.

I will update the Google doc accordingly.

The next question for me is where and how to store the refresh(freshness) information. There are currently two proposed solutions:

  1. Store refresh information in the summary field of a storage table snapshot
  2. Store refresh information in storage table properties

I will create a section in the Google doc that summarizes these options.

React with the following if you prefer one of the solutions: :tada: for solution 1 :rocket: for solution 2

jackye1995 commented 1 year ago

I would +1 on storing in snapshot summary, because:

  1. snapshot corresponds very well to MV refresh, there is a 1:1 relationship between them.
  2. table properties is not versioned as well as snapshot, you cannot access previous able properties of a storage table easily

I also agree with the suggestion of Ryan about the information to store, although I think storing the source tables referenced might be a bit hard, it requires statement analysis of the view.

jackye1995 commented 1 year ago

@wmoustafa @JanKaul

I had some brief discussion about this topic in the community sync this Wednesday, and looking at this thread, I think there is already a consensus about the MV spec. Can we proceed to the actual spec change? Do you have the bandwidth to do that? If not I am also interested in making that change.

wmoustafa commented 1 year ago

+1. We can proceed.

JanKaul commented 1 year ago

Thank you @jackye1995 for your initiative. I have created a new google doc to discuss the actual specification for materialized views.

JanKaul commented 1 year ago

Another point I would like to discuss is whether the storage table should be registered in the catalog. I brought this up before but I think people haven't really voted for a solution. Generally the storage table can either be registered in the catalog or not. Most proposals in this thread suggested to register both the view and the storage table in the catalog. I would like to propose to only register the view in the catalog. My reasoning is that a materialized view is one logical entity and should therefore also appear in the catalog as one entity. This is also the behavior of most RDBMS and also cloud data warehouses like snowflake.

Strategy 1: register storage table in catalog Strategy 2: don't register storage table in catalog
Description Metadata locations for the view and storage table are stored in the catalog. Since both locations are tracked, each transaction is automatically atomic. Only the location of the main view is stored in the catalog. A reference to the storage table metadata location is stored in the main view. An atomic update of this reference is the basis for making atomic changes to the storage table metadata file.
Pros
  • Storage table can be addressed in the catalog
  • Simple implementation
  • only one entry for materialized view
Cons
  • 2 entities appear in the catalog
  • definition of commit procedure

React with the following if you prefer one of the strategies: :tada: for strategy 1 :rocket: for strategy 2

wmoustafa commented 1 year ago

Agree it makes more sense to store as one logical entity. Also I think to register with two names, we would have to assign the view and the table different names. The consequences of that are not clear.

jackye1995 commented 1 year ago

I think it has to be a catalog-specific decision, because as you said Snowflake has that hidden, but for example Trino view on Hive/Glue has the storage table exposed. So I am not sure if it needs to be a part of the spec or not.

I think these are different type of catalogs, typically we distinguish them as business data catalog and technical data catalog. The former is more oriented for end user, whereas the later is more oriented to technical users that would like to see all the hidden stuffs.

From implementation perspective, why does approach 2 need "definition of commit procedure" but not in 1? I think it's just a matter of hide the storage table or not, but there will always be 2 objects behind the scene.

Btw, @nastra is already doing a proposal of multi-table transaction, so I think that problem will be solved there and we don't need to worry in the MV spec.

JanKaul commented 1 year ago

It would be great if we could make it a catalog-specific decision. But for that the metadata has to be designed to enable both strategies.

I think the question is what do we use as an unique identifier for the storage table in the representation of the common view?

One approach is to use table_name, namespace and catalog as unique identifier. But this only works if the storage table is registered in the catalog. Furthermore, without being registered in the catalog, an atomic swap of the storage table metadata file cannot be guaranteed.

Another approach would be to use the storage table metadata file location as unique identifier. This makes using the catalog a bit more awkward because the tables have to be filtered for the metadata file location. But this approach wouldn't require the storage table to be registered in the catalog. And it would enable atomic transactions on the storage table by making atomic changes to the storage table metadata file location. This is what I meant by commit procedure.

wmoustafa commented 1 year ago

I think using the user-facing table namespace as an identifier is catalog specific, so I do not think it can make it to the spec. Maybe we should answer:

I think this boils down to engines awareness of views being connected to storage tables, and navigating to the table scan APIs from a view entry point. I think using the table storage location (or table UUID) in the view still serves this requirement.

because the tables have to be filtered for the metadata file location.

Could you clarify this a bit more?

JanKaul commented 1 year ago

When you want to use the metadata file location to retrieve the storage table from a catalog you would need to search all entries for one where the location field of the metadata matches the provided location. Not all catalogs might be capable of such an operation.

wmoustafa commented 1 year ago

I see. I think this goes back to my previous comment that the engine catalog APIs will have to be able to resolve views as tables. They will simply navigate to the storage table location from the view as an entry point. Engine catalog APIs may have to evolve to do that, along with view resolution implementation. @jackye1995 do you agree this is the workflow? @JanKaul, does this make filtering the catalog by table location not required?

JanKaul commented 1 year ago

I'm not entirely sure if I understand you correctly. I would expect a query engine catalog to proceed according to the following procedure:

  1. Find view metadata

Find view metadata according to namespace and view name

  1. Read View metadata

If query engine doesn't support materialized views, then execute view query.

If query engine supports materialized views, then use storage table pointer (table_name+namespace, metadata location, UUID) to access storage table metadata. If the storage table metadata location is used, retrieving the metadata is easy. If namespace+table_name is used, the metadata has to be fetched from the iceberg or query engine catalog.

  1. Read storage table metadata

Use storage table metadata to execute TableScan operator.

Ideally, one would implement some kind of query engine catalog interface for the iceberg catalog. So that the iceberg catalog can be used directly.

wmoustafa commented 1 year ago

That is correct @JanKaul.

JanKaul commented 1 year ago

Another point I would like to discuss is whether the storage table should be registered in the catalog. I brought this up before but I think people haven't really voted for a solution. Generally the storage table can either be registered in the catalog or not. Most proposals in this thread suggested to register both the view and the storage table in the catalog. I would like to propose to only register the view in the catalog. My reasoning is that a materialized view is one logical entity and should therefore also appear in the catalog as one entity. This is also the behavior of most RDBMS and also cloud data warehouses like snowflake. Strategy 1: register storage table in catalog Strategy 2: don't register storage table in catalog Description Metadata locations for the view and storage table are stored in the catalog. Since both locations are tracked, each transaction is automatically atomic. Only the location of the main view is stored in the catalog. A reference to the storage table metadata location is stored in the main view. An atomic update of this reference is the basis for making atomic changes to the storage table metadata file. Pros

* Storage table can be addressed in the catalog

* Simple implementation

* only one entry for materialized view

Cons

* 2 entities appear in the catalog

* definition of commit procedure

React with the following if you prefer one of the strategies: tada for strategy 1 rocket for strategy 2

@rdblue do you have any preference for which unique identifier to use for referencing the storage table?

JanKaul commented 1 year ago

I have an idea how we could define a general specification for the "storage table pointer" that would allow to decide on a case by case basis whether to register the storage table in the catalog or not.

We could use a storage table pointer of the form:

protocol :// identifier

For example:

wmoustafa commented 1 year ago

I do not think the Iceberg metadata should "point back" to the catalog IDs (first option). What is the disadvantage of just going with the location of the storage table (last option, but without the protocol part)?

liurenjie1024 commented 1 year ago

Any update on this?

JanKaul commented 1 year ago

I think we almost have a consensus about the Materialized View Spec. I brought up one issue with the pointer to the storage table that is holding up the progress at the moment. The issue is whether to use the table identifier (catalog, namespace, name) or the table metadata location and therefore whether to register the storage table in the catalog.

I've come to realize that using the table identifier is the only solution that works for all iceberg catalogs because using the table metadata location doesn't work with the REST catalog.

So in my opinion we can settle my issue and use the table identifier as the pointer to the storage table. We would therefore then need to store the storage table in the catalog.

With this we should have a consensus on the Spec. I will update the documents and then everyone can have another look.

JanKaul commented 1 year ago

I've updated the issue description and the google doc (https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing).

I would love to get your feedback. @liurenjie1024 @jackye1995 @wmoustafa @rdblue

JanKaul commented 1 year ago

@wmoustafa I hope you are okay with using the table identifier as the storage table pointer instead of using the metadata location. But I don't see a way to use the metadata location with the REST catalog. And ultimately we require a robust solution that works across all catalogs.

wmoustafa commented 11 months ago

Could you clarify how it does not work with the catalog API? At the end of the day, the catalog API takes a table identifier and returns an object. That can be table object, virtual view object, or a hybrid object (has characteristics of both table and virtual view objects). In the case of materialized view, the catalog should be able to take a view name, i.e., identifier, and return that hybrid object. So both the virtual view and storage table are addressed by the view name. That said, I am still not clear why the Iceberg spec needs to be aware of the catalog-specific table identifier. Table identifiers are not part of the Iceberg table spec, so I do not think they can make it to the materialized view spec. (Table identifier as in catalog.db.table should not be confused with table UUID -- the latter is part of the V2 spec which should be fine to use).

JanKaul commented 11 months ago

You are right, I was confused because the REST catalog stores the metadata internally and I thought this would include the storage table. But it would be possible to store the metadata for the view in the REST catalog and still store the metadata for the storage table in a metadata.json file. This way the metadata-location could still be used as a pointer to the storage table. Thanks for thinking this through again.

This leaves us again with the question of what to use for the storage table pointer.

wmoustafa commented 11 months ago

This leaves us again with the question of what to use for the storage table pointer.

I think table UUID or location is fine. Same should apply when referencing the base tables and their respective snapshot IDs. Maybe for base tables, UUID makes more sense.

JanKaul commented 11 months ago

Thanks for your input. The discussion has moved to the Google doc (https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?usp=sharing). It would be great if you could add your comment there. Then it will have a higher visibility.

JanKaul commented 11 months ago

@jackye1995 it would be great if you could have a look at the google doc.

szehon-ho commented 9 months ago

Hi @JanKaul . Thanks for putting this together. I went through the detailed discussion and left my votes. I see there is general consensus to the "Open Questions" in the design docs, which are:

  1. The pointer to the storage table should be stored as an optional field in the view metadata (option 1)
  2. Lineage information should be stored as additional fields in the summary of the storage table (option 2)
  3. Only the view (and not storage table) should be registered in catalog (option 2)

Is that correct? Then given that, as it takes a bit of time to parse the discussion to see the actual changes, I went ahead and summarized the additions we are making to the current metadata spec below (taken from your doc).

View Metadata

v1 v2 Field Name Description
  optional materialization An optional materialization struct. If the value is null the entity is a common view, otherwise it is a materialized view
Materialization Struct v1 Field Name Description
required format-version An integer version number for the materialized view format. Currently, this must be 1. Implementations must throw an exception if the materialized view's version is higher than the supported version.
required storage-table Table metadata location

Snapshot

v1 Field Name Description
optional refresh-version-id Version id of the materialized view when the refresh operation was performed.
optional source-tables A List of source-table records.
Source Table Struct v1 Field Name Description
required identifier Identifier of the table as defined in the SQL expression.
required snapshot-id Snapshot id of the source table when the last refresh operation was performed.

Let me know if that looks right.

My 2c on this are:

  1. The materialization struct having its own format version seems overkill to me, maybe we can just flatten it and make the materialization directly just the storage-table pointer itself?
  2. Agree with @jackye1995 on the comment above: https://github.com/apache/iceberg/issues/6420#issuecomment-1398572156, I feel having the list of source-tables is a bit difficult. But I suppose its an optional field, and engine can populate them if they are readily available, (assuming source tables are Iceberg tables).
  3. How about the 'refresh-strategy'? (Didnt see it in the google-doc). I feel it can in the current 'properties' field of view metadata. iiuc, @rdblue also had suggested putting this in the table properties of the storage table along with other fields like materialized_view_format_version and view_identifier, which sounds fine too.

If there is general consensus on the direction, it'd be great to move to on the actual spec pr change and discuss specifics there, as seems like this proposal has been sitting awhile? I can also help with that, if needed. Thanks.

JanKaul commented 9 months ago

Hi @szehon-ho, thanks for trying to move the process of reaching consensus along. To be honest, I don't know how the community normally reaches consensus on these kinds of topics. But I still have the feeling that we are lacking the feedback from some key stakeholders. Until now I was waiting to get more feedback but it seems like creating the PRs is the right thing to move the proposal forward.

In any case it might be good to bring this up at a community sync.

I would argue for 2 PRs, one for the changes to the view metadata and one for the changes to the table metadata.

Regarding the open questions

  1. Question: I agree that we mostly have a consensus there
  2. Question: I get the impression from the google doc that people would prefer Option 1, also you voted for 1
  3. Question: Most people seem to be for Option 2

Regarding your 2cents

  1. I totally agree, just using the storage table pointer is a cleaner solution.
  2. The important information here are the snapshot-ids of the source-tables(base-tables) corresponding to the last refresh operation of the materialized view. The materialized view requires this information to determine if the precomputed data is still fresh. At the end of a refresh operation the materialized view stores the snapshot-ids of its source tables as it's lineage information. Later on, it can check if the snapshot-ids are still equal to the current snapshot-ids of the source tables. If they're not equal, it knows the data changed and its precomputed data needs to be updated.
  3. As can be seen in the last point of https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.m5kli4l5q7ui, we agreed to leave the refresh strategy to the query engine. If available, an incremental refresh is always preferable to a full refresh.
szehon-ho commented 8 months ago

Great, this all makes sense to me except following points:

  1. Question: Most people seem to be for Option 1

As we clarified in the google doc, it seems (at least among the local thread) we are moving to Option 2: https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1&disco=AAABFByy0mU

  1. As can be seen in the last point of https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?pli=1#heading=h.m5kli4l5q7ui, we agreed to leave the refresh strategy to the query engine. If available, an incremental refresh is always preferable to a full refresh.

Did we consider making this property configurable? We should at least store what the materialized view refresh strategy was used in the snapshot for debugging? We can discuss on the google doc.

So Im open to either, we can make a PR, or fill out the Specification Draft part of the doc ? The main barrier right now to more folks reviewing is that the doc is just a list of options and long threads that makes it really hard to see where the consensus is, it took me quite some time to follow. Having a central section of what concrete additions we are proposing will greatly streamline the review experience. Feel free to take my summarized spec changes in my previous comment as a starting point.

Let me know if you need help with it. Thanks

JanKaul commented 8 months ago

Sorry, for Question 3 I mixed up Options 1 & 2. You are right there is consensus on 2.

Great, it seems like all open questions have been answered. I will update the Specification Draft accordingly. Additionally we can create the PRs to make the spec changes. If you are interested you can create the PRs otherwise I will do it tomorrow.