Closed c-thiel closed 2 weeks ago
Introducing SchemalessPartitionSpec
might be our way to avoid https://github.com/apache/iceberg/issues/4563.
@liurenjie1024 , @Xuanwo could you maybe have a brief first look if this is going in a good direction? Then I could start refactoring the Builder.
Hi, @c-thiel, thanks a lot for working on this. This issue does seem complex. I didn't join the discussion before, so I will take some time to go through all the references and then review this PR. Sorry for the delay.
Hey @liurenjie1024 , thanks for giving it a look! Regarding 1) they are not identical (see also my argument from the start. FieldId is optional for unbound spec, but it is not for "previously bound" partition specs which are present in TableMetadata. This is a very important field which is why I don't like using the unbound spec in place of the schemaless spec. It would lead to unnecessary error handling in various places where the field is needed. We also can't make the unbound field required.
I think we are OK to neither expose enum nor trait now. Would you be ok with using the trait internally (pub crate) for now so that I don't have to duplicate logic?
Hi, @c-thiel
Regarding 1) they are not identical (see also my argument from the start. FieldId is optional for unbound spec, but it is not for "previously bound" partition specs which are present in TableMetadata. This is a very important field which is why I don't like using the unbound spec in place of the schemaless spec. It would lead to unnecessary error handling in various places where the field is needed. We also can't make the unbound field required.
This makes sense to me. Typically I don't want to introduce data structures that don't exists in java/python library, since it would be quite confusing for java/python community members to understand and review. But this maybe a special case, and I think java implementation's use of buildUnchecked
maybe unsafe either.
I think we are OK to neither expose enum nor trait now. Would you be ok with using the trait internally (pub crate) for now so that I don't have to duplicate logic?
I'm fine with limiting trait to crate only without exposing them to user, but the XXInterface
doesn't sound to follow rust's naming convention. There are also other methods for removing duplication logics, such as macro. I think macro maybe a better tool here?
I believe using enum
here is sufficient. It's acceptable for us to have some duplicated code since we know the APIs we need are limited and unlikely to expand significantly in the future.
I want to avoid using too many macros in our codebase, as they can make it difficult to read and maintain, particularly at our API level.
I believe using
enum
here is sufficient. It's acceptable for us to have some duplicated code since we know the APIs we need are limited and unlikely to expand significantly in the future.I want to avoid using too many macros in our codebase, as they can make it difficult to read and maintain, particularly at our API level.
enum
also sounds good to me.
Enum it is then. I'll refractor it by the end of this week. Thanks @Xuanwo @liurenjie1024!
@Xuanwo, @liurenjie1024 ready for another round. Ditched both the trait and the enum. Bare with me:
As already mentioned earlier, the enum or the trait were just a vehicle to reduce duplication of logic and to clarify that there are three variants of partition specs with different nuances. However the enum should rarely be used directly, as individual variants are much more powerful with a clear hierarchy: Bound > Schemaless > Unbound. Thus, it would be bad to ship the enum around to functions that require the Bound or Schemaless variants (i.e. need a field_id). It would always require matching and error handling. Thus, we use the specific variants pretty much everywhere anyway.
If we agree that we shouldn't use the enum to pass PartitionSpecs, then having an enum with owned fields is bad, because it requires a lot of clone
. This is why I oped for enum<'a> PartitionSpec<'a>
in https://github.com/apache/iceberg-rust/commit/9562d22a2cc550ecaf04cb857b0ad15e67c94603.
I didn't like the enum though as I didn't want to use it anyway. So in my last commit I got rid of that as well.
We gain a lot less code and API surface. We loose quite a bit of interoperability between different schema variants. What especially hurts is the is_compatible_with
method, which used to be completely cross-functional for all variants.
I implement it now only for the single combination needed by the TableMetadataBuilder
now (Schemaless compatible with Unbound)
Let me know what you think!
Also please merge with main branch to keep it updated.
@Xuanwo, @liurenjie1024 ready for another round. Ditched both the trait and the enum. Bare with me:
As already mentioned earlier, the enum or the trait were just a vehicle to reduce duplication of logic and to clarify that there are three variants of partition specs with different nuances. However the enum should rarely be used directly, as individual variants are much more powerful with a clear hierarchy: Bound > Schemaless > Unbound. Thus, it would be bad to ship the enum around to functions that require the Bound or Schemaless variants (i.e. need a field_id). It would always require matching and error handling. Thus, we use the specific variants pretty much everywhere anyway.
If we agree that we shouldn't use the enum to pass PartitionSpecs, then having an enum with owned fields is bad, because it requires a lot of
clone
. This is why I oped forenum<'a> PartitionSpec<'a>
in 9562d22.I didn't like the enum though as I didn't want to use it anyway. So in my last commit I got rid of that as well. We gain a lot less code and API surface. We loose quite a bit of interoperability between different schema variants. What especially hurts is the
is_compatible_with
method, which used to be completely cross-functional for all variants. I implement it now only for the single combination needed by theTableMetadataBuilder
now (Schemaless compatible with Unbound)Let me know what you think!
Sorry I missed this comment. I agree that trait/enum just for reducing duplication of logic is somehow to antipattern, and sometimes a little duplication is acceptable.
Thanks @c-thiel for this great pr!
Fixes https://github.com/apache/iceberg-rust/issues/550
This PR is a result of the issue mentioned above, a Slack Discussion and the preparation for the new MetadataBuilder https://github.com/apache/iceberg-rust/pull/587.
Lets first establish, that
PartitionSpec
should have a bound schema because:PartitonSpec
to allow internal partition id handling (see @rdblue comment in slack).Result
.If we agree on this, then we have a small Problem with
TableMetadata
: It contains historic PartitionSpecs that cannot be bound against thecurrent_schema
. As a result, we need a type that has the same properties asPartitionSpec
but is not bound to a schema. I thought we could useUnboundPartitionSpec
for this at first, but it serves a different purpose and would make the very importantfield_id
Optional.To solve this, I introduce a
SchemalessPartitionSpec
. Common functions betweenPartitionSpec
andSchemalessPartitionSpec
are made available via a common trait. While the trait is public and as such allows to develop functions that take either variant as an input, I am exposing the most important attributes (i.e.fields
) still directly, so that the trait typically doesn't need to be imported.For reference, Java solves this by force-binding potentially non-compatible schemas to a
PartitionSpec
TableMetadataBuilder
: https://github.com/apache/iceberg/blob/5a2c1c9c8dd8d61e08bbad713c24e6b726eee323/core/src/main/java/org/apache/iceberg/TableMetadata.java#L734TableMetadataParser
: https://github.com/apache/iceberg/blob/5a2c1c9c8dd8d61e08bbad713c24e6b726eee323/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L393Happy to hear your opinions! This really is a bit of a fiddly topic. I don't like to introduce a new type, on the other hand using the API feels more natural to me now. This can also be seen in the tests, where we had a significant number of places where we passed
PartitionSpec
alongside itsSchema
around, just to callpartition_type
eventually.