apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://gravitino.apache.org
Apache License 2.0
908 stars 292 forks source link

[FEATURE] Provide Batch Load Api for Multiple Entities #3989

Open TEOTEO520 opened 2 months ago

TEOTEO520 commented 2 months ago

Describe the feature

At present, Gravitino only provides APIs for precisely querying the details of an entity based on a specific NameIdentifier (such as getTable, getPartition, loadFileset, etc.), but does not offer an API for batch querying a subset of entities within a schema based on a List.

For users, batch querying is a very common requirement. Currently, for example, if a user wants to retrieve the details of n tables, they would need to call the getTable API n times, which is unnecessary and factors such as network jitter may cause some responses to slow down. If we implement a batch query interface, it would greatly reduce the number of RPC calls and also significantly decrease the time spent in RPC network traffic.

For metadata management systems that we are familiar with, such as HiveMetaStore which Gravitino is using, they have implemented batch query interfaces, such as:

List<Table> getTableObjectsByName(String dbName, List<String> tableNames);
List<Partition> getPartitionsByNames(String db_name, String tbl_name, List<String> part_names);

Therefore, we believe that for Gravitino, as a high-performance and federated metadata lake, it is necessary to implement a batch query interface for various entities (table, partition, fileset, topic, etc.) within a schema.

Currently, due to the batch query requirements in actual production applications, we have preliminarily implemented a batch query interface for filesets

Fileset[] loadFilesetList(NameIdentifier[] idents);

I have already submitted a PR, and I wish to contribute it to the community if the community is interested. We can discuss together whether this is a good implementation and whether it can be applied to other data sources.

Motivation

No response

Describe the solution

No response

Additional context

No response

mchades commented 2 months ago

When there are a large number of tables under a schema, performance issues may arise. How should this be resolved?

In addition, not all catalogs support batch retrieval of metadata, for example, JDBC-related catalogs generally do not support it.

TEOTEO520 commented 2 months ago

For performance issues caused by a large number of tables under a schema, we can set a parameter to limit the maximum number of entities that a user can query at once. Although some catalogs like JDBC do not provide batch querying capabilities, other common catalogs like HMS do support this feature, and in actual usage scenarios of our users, this need indeed exists. I believe that Gravitino, as a high-performance, geo-distributed, and federated metadata lake, should be capable of providing this functionality to users.

TEOTEO520 commented 1 month ago

@mchades @jerryshao +cc

shaofengshi commented 4 weeks ago

+1, the requirement is solid, batch interface is common.