apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.51k stars 2.25k forks source link

When write.object-storage.enabled=true, it is difficult to gather information for individual partition of partitioned tables #11488

Open borderlayout opened 2 weeks ago

borderlayout commented 2 weeks ago

Feature Request / Improvement

Hi all: When using Amazon S3 object storage with Iceberg, there can be a throttling issue for the same path. By setting the parameter write.object-storage.enabled=true, files under the same file path are hashed to different paths, which avoids the throttling issue with Amazon S3 object storage. (see:https://iceberg.apache.org/docs/nightly/docs/configuration/?h=write.object+storage.enabled#write-properties

However, I encountered a problem: when setting up partitioned tables, the hash values in the path are inserted before the partition name, making it difficult to gather information for individual partition, such as the number of files or file sizes of one partition.

Is there a reason for designing it this way? If putting the random value after the partition fields would be a better approach ?

bucket/iceberg_test1/data/_44Xmw/parCol=2024-01-10/00295-2798-63356e4e-b4ec-4a80-ae3f-6888f2f7eac9-0-00003.parquet bucket/iceberg_test1/data/_5l5dQ/parCol=2024-01-09/00063-2566-63356e4e-b4ec-4a80-ae3f-6888f2f7eac9-0-00006.parquet

==changed ==> bucket/iceberg_test1/data/parCol=2024-01-10/_44Xmw/00295-2798-63356e4e-b4ec-4a80-ae3f-6888f2f7eac9-0-00003.parquet bucket/iceberg_test1/data/parCol=2024-01-09/_5l5dQ/00063-2566-63356e4e-b4ec-4a80-ae3f-6888f2f7eac9-0-00006.parquet

bucket/iceberg_test3/data/APigWw/parCol=2024-01-01/gender=male/00001-7234-7e44c302-a716-4da8-9ea0-0c44caf9a249-0-00003.parquet bucket/iceberg_test3/data/4Z-_sw/parCol=2024-01-01/gender=male/00001-7234-7e44c302-a716-4da8-9ea0-0c44caf9a249-0-00001.parquet

===changed==> bucket/iceberg_test3/data/parCol=2024-01-01/gender=male/APigWw/00001-7234-7e44c302-a716-4da8-9ea0-0c44caf9a249-0-00003.parquet bucket/iceberg_test3/data/parCol=2024-01-01/gender=male/4Z-_sw/00001-7234-7e44c302-a716-4da8-9ea0-0c44caf9a249-0-00001.parquet

Query engine

Spark

Willingness to contribute

RussellSpitzer commented 4 days ago

In general you shouldn't be using the pathing information for this, instead you should use the Files or Partitions Metadata tables. This is important because the storage layer gives you the full history of the table and not the current state. For example just because you have 10 files in a directory, it doesn't mean all 10 are live in the current table state.