Open theoryxu opened 2 months ago
This is something like querying metadata tables. It seems reasonable to support it in Gravitino. My concern is that it may require too many resources to produce the metadata. We could leverage K8s, but they will introduce complexity for Gravitino. @jerryshao @caican00 @shaofengshi @xunliu WDYT?
I think we can have that API design to support querying metadata table first, whether it is too costly or not depends on the underlying sources
I think we have best to discuss the scope of the ability and related scenarios at first.
the issue is also related to this discussion, and imo, it seems reasonable to simply get the metadata of the metadata tables from gravitino.
However, it seems unreasonable to read data from the metadata tables through gravitino, the data should be read through the connector.
in addition, metadata tables should not support operations such as create, alter, and drop in gravitino.
Snapshots are similar to columns in which they are part of the table metadata. Whether we support modification operations, such as changing snapshots, can be determined based on user requirements.
Snapshots are similar to columns in that they are part of the table metadata. Whether we support modification operations, such as changing snapshots, can be determined based on user requirements.
why not use spark-procedures directly?
If gravitino supports modification operations on the metadata tables, for example, deleting a snapshot, the corresponding data file also needs to be deleted. If gravitino is used to perform this operation, the rest api
is likely to time out.
Snapshots are similar to columns in that they are part of the table metadata. Whether we support modification operations, such as changing snapshots, can be determined based on user requirements.
why not use spark-procedures directly?
If gravitino supports modification operations on the metadata tables, for example, deleting a snapshot, the corresponding data file also needs to be deleted. If gravitino is used to perform this operation, the
rest api
is likely to time out.
we could query table metadata by Spark or Flink, but it's heavy and hard for normal users, with a REST interface , it's simple to query snapshots like metadata for end users or other internal systems.
What would you like to be improved?
In addition to regular table functions, datalake table formats offer various capabilities for inspecting tables.
For instance, Iceberg can display valid snapshots for a table or show a table's current file manifests.
However, Gravitino catalogs currently lack this support, and there is no designated place for it in the General API hierarchy.
Incorporating this support into Gravitino would help users better manage their datalakes.
How should we improve?
No response