Open rdsarvar opened 2 weeks ago
I've provided a draft PR with a sample solution here: https://github.com/apache/iceberg/pull/11368 but I'm open to feedback (/throwing out that PR) for if there's a cleaner solution to this implementation.
Note: It's nowhere near mergable but it provides a gist of one implementation
Feature Request / Improvement
Context
Similar to how we can provide an explicit sort ordering OR rely on the existing sorting of the table, I would like to propose that compaction support explicit partition specifications.
Currently, compaction allows for you to specify the partition spec ID that you want to use through the
options
mapping. This is useful for enabling different partitioning for compaction but comes with the caveat that the partition spec had to have been applied previously to the table AND you must manually find that spec ID and apply it.The meat of the request is:
Motivation
In some cases folks want to be able to support partition tiering for long term storage. As an example:
This enables us to bloat the metadata with recent partitions but shrink the metadata for longer term storage so that we can have a single table over the long term instead of having to use multiple tables with a view sitting on top.
Technically this is already possible by first initializing the table with your 'archival' partition spec so it generates and ID then you swap to your 'active' partition spec. The user can then grab the ID and pass that in through options but it's an inconvenient process over just providing the spec and having Iceberg decide the next actions to be made.
Query engine
Spark
Willingness to contribute