feat: Safer PartitionSpec & SchemalessPartitionSpec

c-thiel commented 1 month ago

Fixes https://github.com/apache/iceberg-rust/issues/550

This PR is a result of the issue mentioned above, a Slack Discussion and the preparation for the new MetadataBuilder https://github.com/apache/iceberg-rust/pull/587.

Lets first establish, that PartitionSpec should have a bound schema because:

We need to bind the schema to to PartitonSpec to allow internal partition id handling (see @rdblue comment in slack).
It allows for consistent handling of schema related operations, such as determining the partition type without returning a Result.
A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise.

If we agree on this, then we have a small Problem with TableMetadata: It contains historic PartitionSpecs that cannot be bound against the current_schema. As a result, we need a type that has the same properties as PartitionSpec but is not bound to a schema. I thought we could use UnboundPartitionSpec for this at first, but it serves a different purpose and would make the very important field_id Optional.

To solve this, I introduce a SchemalessPartitionSpec. Common functions between PartitionSpec and SchemalessPartitionSpec are made available via a common trait. While the trait is public and as such allows to develop functions that take either variant as an input, I am exposing the most important attributes (i.e. fields) still directly, so that the trait typically doesn't need to be imported.

For reference, Java solves this by force-binding potentially non-compatible schemas to a PartitionSpec

Happy to hear your opinions! This really is a bit of a fiddly topic. I don't like to introduce a new type, on the other hand using the API feels more natural to me now. This can also be seen in the tests, where we had a significant number of places where we passed PartitionSpec alongside its Schema around, just to call partition_type eventually.

c-thiel commented 1 month ago

Introducing SchemalessPartitionSpec might be our way to avoid https://github.com/apache/iceberg/issues/4563.

c-thiel commented 1 month ago

@liurenjie1024 , @Xuanwo could you maybe have a brief first look if this is going in a good direction? Then I could start refactoring the Builder.

Xuanwo commented 1 month ago

Hi, @c-thiel, thanks a lot for working on this. This issue does seem complex. I didn't join the discussion before, so I will take some time to go through all the references and then review this PR. Sorry for the delay.

c-thiel commented 1 month ago

Hey @liurenjie1024 , thanks for giving it a look! Regarding 1) they are not identical (see also my argument from the start. FieldId is optional for unbound spec, but it is not for "previously bound" partition specs which are present in TableMetadata. This is a very important field which is why I don't like using the unbound spec in place of the schemaless spec. It would lead to unnecessary error handling in various places where the field is needed. We also can't make the unbound field required.

I think we are OK to neither expose enum nor trait now. Would you be ok with using the trait internally (pub crate) for now so that I don't have to duplicate logic?

liurenjie1024 commented 4 weeks ago

Hi, @c-thiel

Regarding 1) they are not identical (see also my argument from the start. FieldId is optional for unbound spec, but it is not for "previously bound" partition specs which are present in TableMetadata. This is a very important field which is why I don't like using the unbound spec in place of the schemaless spec. It would lead to unnecessary error handling in various places where the field is needed. We also can't make the unbound field required.

This makes sense to me. Typically I don't want to introduce data structures that don't exists in java/python library, since it would be quite confusing for java/python community members to understand and review. But this maybe a special case, and I think java implementation's use of buildUnchecked maybe unsafe either.

I think we are OK to neither expose enum nor trait now. Would you be ok with using the trait internally (pub crate) for now so that I don't have to duplicate logic?

I'm fine with limiting trait to crate only without exposing them to user, but the XXInterface doesn't sound to follow rust's naming convention. There are also other methods for removing duplication logics, such as macro. I think macro maybe a better tool here?

Xuanwo commented 4 weeks ago

I believe using enum here is sufficient. It's acceptable for us to have some duplicated code since we know the APIs we need are limited and unlikely to expand significantly in the future.

I want to avoid using too many macros in our codebase, as they can make it difficult to read and maintain, particularly at our API level.

liurenjie1024 commented 4 weeks ago

I believe using enum here is sufficient. It's acceptable for us to have some duplicated code since we know the APIs we need are limited and unlikely to expand significantly in the future.

I want to avoid using too many macros in our codebase, as they can make it difficult to read and maintain, particularly at our API level.

enum also sounds good to me.

c-thiel commented 4 weeks ago

Enum it is then. I'll refractor it by the end of this week. Thanks @Xuanwo @liurenjie1024!

c-thiel commented 3 weeks ago

@Xuanwo, @liurenjie1024 ready for another round. Ditched both the trait and the enum. Bare with me:

As already mentioned earlier, the enum or the trait were just a vehicle to reduce duplication of logic and to clarify that there are three variants of partition specs with different nuances. However the enum should rarely be used directly, as individual variants are much more powerful with a clear hierarchy: Bound > Schemaless > Unbound. Thus, it would be bad to ship the enum around to functions that require the Bound or Schemaless variants (i.e. need a field_id). It would always require matching and error handling. Thus, we use the specific variants pretty much everywhere anyway.

If we agree that we shouldn't use the enum to pass PartitionSpecs, then having an enum with owned fields is bad, because it requires a lot of clone. This is why I oped for enum<'a> PartitionSpec<'a> in https://github.com/apache/iceberg-rust/commit/9562d22a2cc550ecaf04cb857b0ad15e67c94603.

I didn't like the enum though as I didn't want to use it anyway. So in my last commit I got rid of that as well. We gain a lot less code and API surface. We loose quite a bit of interoperability between different schema variants. What especially hurts is the is_compatible_with method, which used to be completely cross-functional for all variants. I implement it now only for the single combination needed by the TableMetadataBuilder now (Schemaless compatible with Unbound)

Let me know what you think!

liurenjie1024 commented 3 weeks ago

Also please merge with main branch to keep it updated.

liurenjie1024 commented 2 weeks ago

@Xuanwo, @liurenjie1024 ready for another round. Ditched both the trait and the enum. Bare with me:

As already mentioned earlier, the enum or the trait were just a vehicle to reduce duplication of logic and to clarify that there are three variants of partition specs with different nuances. However the enum should rarely be used directly, as individual variants are much more powerful with a clear hierarchy: Bound > Schemaless > Unbound. Thus, it would be bad to ship the enum around to functions that require the Bound or Schemaless variants (i.e. need a field_id). It would always require matching and error handling. Thus, we use the specific variants pretty much everywhere anyway.

If we agree that we shouldn't use the enum to pass PartitionSpecs, then having an enum with owned fields is bad, because it requires a lot of clone. This is why I oped for enum<'a> PartitionSpec<'a> in 9562d22.

I didn't like the enum though as I didn't want to use it anyway. So in my last commit I got rid of that as well. We gain a lot less code and API surface. We loose quite a bit of interoperability between different schema variants. What especially hurts is the is_compatible_with method, which used to be completely cross-functional for all variants. I implement it now only for the single combination needed by the TableMetadataBuilder now (Schemaless compatible with Unbound)

Let me know what you think!

Sorry I missed this comment. I agree that trait/enum just for reducing duplication of logic is somehow to antipattern, and sometimes a little duplication is acceptable.

liurenjie1024 commented 2 weeks ago

Thanks @c-thiel for this great pr!

apache / iceberg-rust

feat: Safer PartitionSpec & SchemalessPartitionSpec #645