Open Xuanwo opened 1 week ago
Hi, @Fokko, I created an issue to make it easier for us to continue the discussion.
Also cc @liurenjie1024 and @c-thiel.
@Xuanwo Thanks, appreciate it. I'm also preparing a PR for Java that illustrates the solution more clearly.
I asked same question, and here is the answer from @c-thiel : https://github.com/apache/iceberg-rust/pull/645#issuecomment-2431923524
And the reason I this is reasonable is here: https://github.com/apache/iceberg-rust/pull/645#issuecomment-2434716448
Here is the reason why I accept it:
SchemalessPartitionSpec
is used when we load table metadata and build partition spec from table metadata. Since there is no schema id associated with partition spec, in java it's built with bindUncheck
with current schema in current snapshot. That's to say, java implementation assumes that the partiton specs are valid with current schema without checking. I'm skeptical that it's a safe approach.
I agree with @c-thiel that it's not appropriate to use UnboundPartitionSpec
since UnboundPartitionSpec
means not validated, user constructed partition spec, which is not the case for partiton specs from table metadata.
I think I have most of my motivation in https://github.com/apache/iceberg-rust/pull/645#issue-2543573501 and the comments referenced above. I follow the argument from Renjie.
For me UnboundPartitionSpec
has a different semantic than a previously bound spec, manifested also in the Optional Field ID. Using UnboundPartitionSpec
in TableMetadata
would introduce lines of error handling in other places, where we expect the FieldId
to be present. As field_id
is a required field in the spec for TableMetadata
, we would have to use SchemalessPartitionSpec
at least to deserialize the json before transforming it to UnboundPartitionSpec
.
@liurenjie1024, @Fokko, @Xuanwo we should probably come to a conclusion here before the next release. @Fokko is there already something in your Java PR which we can peek at? What do you have in mind for a solution?
Unbound (not bound to a schema) and schemaless are the same, so I find them confusing.
Regarding https://github.com/apache/iceberg-rust/pull/645#issue-2543573501 there are some incorrect assumptions there:
If we agree on this, then we have a small Problem with TableMetadata: It contains historic PartitionSpecs that cannot be bound against the current_schema. As a result, we need a type that has the same properties as PartitionSpec but is not bound to a schema. I thought we could use UnboundPartitionSpec for this at first, but it serves a different purpose and would make the very important
field_id
Optional.
I don't understand this statement. If I look at the UnboundPartitionSpec
:
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionField {
/// A source column id from the table’s schema
pub source_id: i32,
/// A partition field id that is used to identify a partition field and is unique within a partition spec.
/// In v2 table metadata, it is unique across all partition specs.
#[builder(default, setter(strip_option))]
pub field_id: Option<i32>,
/// A partition name.
pub name: String,
/// A transform that is applied to the source column to produce a partition value.
pub transform: Transform,
}
/// Unbound partition spec can be built without a schema and later bound to a schema.
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionSpec {
/// Identifier for PartitionSpec
pub(crate) spec_id: Option<i32>,
/// Details of the partition spec
pub(crate) fields: Vec<UnboundPartitionField>,
}
A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise.
This is not entirely correct. For example, when a field is being promoted in a valid way, then it is still valid to bind against the same source-id.
So going from the problem space to the solution space. I was looking at the Java code and the PyIceberg code, and at PyIceberg we took a slightly different approach that might be interesting for Iceberg-rust as well. As mentioned earlier, the source ID might evolve (assume compatible ones int -> long
, float -> double
, etc) over time. The most obvious example is the identity-partition where the sourceType
is equal to the resultType
. Therefore we allow binding with different schema's in PyIceberg: https://github.com/apache/iceberg-python/blob/b2f0a9e5cd7dd548e19cdcdd7f9205f03454369a/pyiceberg/partitioning.py#L203-L225 I think that might be a better approach for Rust as well, as this is now hard-coded:
When the field is not in the current spec anymore, it is also unlikely that you would filter on that field, therefore that field from the partition-spec could also be ignored. The downside is that with strongly typed languages, we need to know the type upfront, and otherwise we only know it when we read the actual Avro file (that has the schema with the type). Therefore I'm playing around with a solution for this on the Java side: https://github.com/apache/iceberg/pull/11542
I hope this helps!
Unbound (not bound to a schema) and schemaless are the same, so I find them confusing.
They are not quite - a previously bound schema has a required field_id
(reference to PartitionField
), while unbound has an optional field_id
(Option<i32>
) as it references UnboundPartitionField
. TableMetadata.partition_specs
requires field_id
, so we shouldn't use UnboundPartitionSpec
there.
A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise.
This is not entirely correct. For example, when a field is being promoted in a valid way, then it is still valid to bind against the same source-id.
True, that was too strict. What I should have expressed it that it is not guaranteed to be compatible with the current schema, which is why Java uses bindUnchecked()
.
I think with the current implementation we are quite close to what you propose - binding a SchemalessPartitionSpec
or UnboundPartitionSpec
to a new schema is easily possible by using spec.to_unbound().bind(new_schema)
. Currently this binding can fail. What is missing is the notion that you described with
When the field is not in the current spec anymore, it is also unlikely that you would filter on that field, therefore that field from the partition-spec could also be ignored
This would require a more permissive bind
method, or an additional flag, that would ignore unknown fields, similar to what you try to incorporate into Java with Any
. The important point is that dynamic re-binding is already possible today.
The BoundPartitionSpec
serves two additional purposes - it is a way to compute the binding and store the result typesafe. This is useful for
default_spec
against the current_schema
- which is the most important combination. I don't see why we shouldn't store the resulting type as we need to assert compatibility anyway for this combination.addPartitionSpec
which eventually runs this code: https://github.com/apache/iceberg/blob/acd7cc1126b192ccb53ad8198bda37e983aa4c6c/core/src/main/java/org/apache/iceberg/TableMetadata.java#L768. Note that currently we don't have this semantic in the builder, which in java performs name-mapping to potentially change the source_id
. Instead we pass an UnboundSpec
which trusts the provides source_id
and not the field name.Summing it up, my approach would be to keep all three types - I simply don't see a good way around them.
However, we should add a more permissive bind method that incorporates better the semantic of binding to an incompatible schema. This would be an alternative to the bindUnchecked()
in Java. Maybe we can just follow the PR you are working on in Java once that's ready. This method would not necessarily be able to determine a type for each field in a spec
.
And the
PartitionField
andUnboundPartitionField
are basically the same.If we cannot find the field anymore, the best thing to do is to include the file in the query plan.
_Originally posted by @Fokko in https://github.com/apache/iceberg-rust/pull/645#discussion_r1836605349_