[Improvement] Add Support for Inspecting Tables in Datalake Formats like Iceberg

apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.

https://gravitino.apache.org

Apache License 2.0

1.09k stars 343 forks source link

[Improvement] Add Support for Inspecting Tables in Datalake Formats like Iceberg #4798

Open theoryxu opened 2 months ago

theoryxu commented 2 months ago

What would you like to be improved?

In addition to regular table functions, datalake table formats offer various capabilities for inspecting tables.

For instance, Iceberg can display valid snapshots for a table or show a table's current file manifests.

However, Gravitino catalogs currently lack this support, and there is no designated place for it in the General API hierarchy.

Incorporating this support into Gravitino would help users better manage their datalakes.

How should we improve?

No response

FANNG1 commented 2 months ago

This is something like querying metadata tables. It seems reasonable to support it in Gravitino. My concern is that it may require too many resources to produce the metadata. We could leverage K8s, but they will introduce complexity for Gravitino. @jerryshao @caican00 @shaofengshi @xunliu WDYT?

jerryshao commented 2 months ago

I think we can have that API design to support querying metadata table first, whether it is too costly or not depends on the underlying sources

caican00 commented 2 months ago

I think we have best to discuss the scope of the ability and related scenarios at first.

the issue is also related to this discussion, and imo, it seems reasonable to simply get the metadata of the metadata tables from gravitino.

However, it seems unreasonable to read data from the metadata tables through gravitino, the data should be read through the connector.

in addition, metadata tables should not support operations such as create, alter, and drop in gravitino.

FANNG1 commented 2 months ago

Snapshots are similar to columns in which they are part of the table metadata. Whether we support modification operations, such as changing snapshots, can be determined based on user requirements.

caican00 commented 2 months ago

Snapshots are similar to columns in that they are part of the table metadata. Whether we support modification operations, such as changing snapshots, can be determined based on user requirements.

why not use spark-procedures directly?

If gravitino supports modification operations on the metadata tables, for example, deleting a snapshot, the corresponding data file also needs to be deleted. If gravitino is used to perform this operation, the rest api is likely to time out.

FANNG1 commented 2 months ago

Snapshots are similar to columns in that they are part of the table metadata. Whether we support modification operations, such as changing snapshots, can be determined based on user requirements.

why not use spark-procedures directly?

If gravitino supports modification operations on the metadata tables, for example, deleting a snapshot, the corresponding data file also needs to be deleted. If gravitino is used to perform this operation, the rest api is likely to time out.

we could query table metadata by Spark or Flink, but it's heavy and hard for normal users, with a REST interface , it's simple to query snapshots like metadata for end users or other internal systems.