apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.48k stars 970 forks source link

[core] Introduce bucket entries to optimize Spark compact #4162

Closed JingsongLi closed 2 months ago

JingsongLi commented 2 months ago

Purpose

Now Spark Bucketed table Compact will plan all files to know the buckets. We can introduce BucketEntry just like PartitionEntry to reduce memory usage.

This PR:

  1. Introduce bucket entries to optimize Spark compact
  2. Add BucketsTable system table to show bucket information.

Tests

  1. Existed tests.
  2. BucketsTableTest.

API and Format

Documentation