apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://gravitino.apache.org
Apache License 2.0
918 stars 297 forks source link

[FEATURE] Add capability to fetch partitions from Iceberg tables #3840

Open SinghAsDev opened 3 months ago

SinghAsDev commented 3 months ago

Describe the feature

Add capability to fetch partitions from Iceberg tables to enable easy to use and efficient mechanism to fetch Iceberg table partitions.

Motivation

It is common for users to build use-cases, like waiters, that depend on partitions information of a table. While users are moving from hive to iceberg table format, one of the blockers they see is the ease and speed of partitions access information. Very large iceberg tables (with multiple 10Ks) partitions takes over an hour and over 50g of memory.

Describe the solution

Add rest endpoint to get Iceberg partitions.

Additional context

No response

FANNG1 commented 3 months ago

@SinghAsDev , thanks for proposing this, I have some questions:

  1. what's the benefit of implementing this on the Iceberg REST server? speed up by cache? this may cost a huge memory.
  2. could you share the scenes about how you use it?
  3. Iceberg introduced partition statistics file in 1.5.0, we should also consider this.
SinghAsDev commented 3 months ago

@FANNG1 please find answers below.

what's the benefit of implementing this on the Iceberg REST server? speed up by cache? this may cost a huge memory.

Having this on rest server will allow for mechanism to fetch partitions quickly and efficiently. This can be achieved through optimizations like caching, skipping reading of manifests with single partition, etc. Another benefit is that different clients will be able to use this without having it.

could you share the scenes about how you use it?

Sure, it will enable existing partition waiters, partition discovery and data freshness toolings to work for Iceberg tables and hive tables.

Iceberg introduced partition statistics file in 1.5.0, we should also consider this.

Sure, that's another benefit of this approach, we can change/ add optimizations with time.

FANNG1 commented 3 months ago

Got it, thanks for your reply.

JunpingDu commented 1 week ago

Thanks for the issue and patch @SinghAsDev , shall we move forward the patch?