apache / iceberg-rust

Apache Iceberg
https://rust.iceberg.apache.org/
Apache License 2.0
672 stars 159 forks source link

Why shouldn't we return an `UnboundPartitionSpec` instead? #694

Open Xuanwo opened 1 week ago

Xuanwo commented 1 week ago
          I'm not convinced that we need another type, such as `SchemalessPartitionSpec`. Why shouldn't we return an `UnboundPartitionSpec` instead? Looking at the signatures:
pub struct UnboundPartitionSpec {
    /// Identifier for PartitionSpec
    pub(crate) spec_id: Option<i32>,
    /// Details of the partition spec
    pub(crate) fields: Vec<UnboundPartitionField>,
}
pub struct SchemalessPartitionSpec {
    /// Identifier for PartitionSpec
    spec_id: i32,
    /// Details of the partition spec
    fields: Vec<PartitionField>,
}

And the PartitionField and UnboundPartitionField are basically the same.

Introducing SchemalessPartitionSpec might be our way to avoid https://github.com/apache/iceberg/issues/4563.

If we cannot find the field anymore, the best thing to do is to include the file in the query plan.

_Originally posted by @Fokko in https://github.com/apache/iceberg-rust/pull/645#discussion_r1836605349_

Xuanwo commented 1 week ago

Hi, @Fokko, I created an issue to make it easier for us to continue the discussion.

Also cc @liurenjie1024 and @c-thiel.

Fokko commented 1 week ago

@Xuanwo Thanks, appreciate it. I'm also preparing a PR for Java that illustrates the solution more clearly.

liurenjie1024 commented 1 week ago

I asked same question, and here is the answer from @c-thiel : https://github.com/apache/iceberg-rust/pull/645#issuecomment-2431923524

And the reason I this is reasonable is here: https://github.com/apache/iceberg-rust/pull/645#issuecomment-2434716448

liurenjie1024 commented 1 week ago

Here is the reason why I accept it:

SchemalessPartitionSpec is used when we load table metadata and build partition spec from table metadata. Since there is no schema id associated with partition spec, in java it's built with bindUncheck with current schema in current snapshot. That's to say, java implementation assumes that the partiton specs are valid with current schema without checking. I'm skeptical that it's a safe approach.

I agree with @c-thiel that it's not appropriate to use UnboundPartitionSpec since UnboundPartitionSpec means not validated, user constructed partition spec, which is not the case for partiton specs from table metadata.

c-thiel commented 1 week ago

I think I have most of my motivation in https://github.com/apache/iceberg-rust/pull/645#issue-2543573501 and the comments referenced above. I follow the argument from Renjie. For me UnboundPartitionSpec has a different semantic than a previously bound spec, manifested also in the Optional Field ID. Using UnboundPartitionSpec in TableMetadata would introduce lines of error handling in other places, where we expect the FieldId to be present. As field_id is a required field in the spec for TableMetadata, we would have to use SchemalessPartitionSpec at least to deserialize the json before transforming it to UnboundPartitionSpec.

c-thiel commented 5 days ago

@liurenjie1024, @Fokko, @Xuanwo we should probably come to a conclusion here before the next release. @Fokko is there already something in your Java PR which we can peek at? What do you have in mind for a solution?

Fokko commented 5 days ago

Unbound (not bound to a schema) and schemaless are the same, so I find them confusing.

Regarding https://github.com/apache/iceberg-rust/pull/645#issue-2543573501 there are some incorrect assumptions there:

If we agree on this, then we have a small Problem with TableMetadata: It contains historic PartitionSpecs that cannot be bound against the current_schema. As a result, we need a type that has the same properties as PartitionSpec but is not bound to a schema. I thought we could use UnboundPartitionSpec for this at first, but it serves a different purpose and would make the very important field_id Optional.

I don't understand this statement. If I look at the UnboundPartitionSpec:

#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, TypedBuilder)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionField {
    /// A source column id from the table’s schema
    pub source_id: i32,
    /// A partition field id that is used to identify a partition field and is unique within a partition spec.
    /// In v2 table metadata, it is unique across all partition specs.
    #[builder(default, setter(strip_option))]
    pub field_id: Option<i32>,
    /// A partition name.
    pub name: String,
    /// A transform that is applied to the source column to produce a partition value.
    pub transform: Transform,
}

/// Unbound partition spec can be built without a schema and later bound to a schema.
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone, Default)]
#[serde(rename_all = "kebab-case")]
pub struct UnboundPartitionSpec {
    /// Identifier for PartitionSpec
    pub(crate) spec_id: Option<i32>,
    /// Details of the partition spec
    pub(crate) fields: Vec<UnboundPartitionField>,
}

A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise.

This is not entirely correct. For example, when a field is being promoted in a valid way, then it is still valid to bind against the same source-id.

So going from the problem space to the solution space. I was looking at the Java code and the PyIceberg code, and at PyIceberg we took a slightly different approach that might be interesting for Iceberg-rust as well. As mentioned earlier, the source ID might evolve (assume compatible ones int -> long, float -> double, etc) over time. The most obvious example is the identity-partition where the sourceType is equal to the resultType. Therefore we allow binding with different schema's in PyIceberg: https://github.com/apache/iceberg-python/blob/b2f0a9e5cd7dd548e19cdcdd7f9205f03454369a/pyiceberg/partitioning.py#L203-L225 I think that might be a better approach for Rust as well, as this is now hard-coded:

https://github.com/apache/iceberg-rust/blob/50345196c87b00badc1a6490aef284e84f4c3e9a/crates/iceberg/src/spec/partition.rs#L60-L70

When the field is not in the current spec anymore, it is also unlikely that you would filter on that field, therefore that field from the partition-spec could also be ignored. The downside is that with strongly typed languages, we need to know the type upfront, and otherwise we only know it when we read the actual Avro file (that has the schema with the type). Therefore I'm playing around with a solution for this on the Java side: https://github.com/apache/iceberg/pull/11542

I hope this helps!

c-thiel commented 5 days ago

Unbound (not bound to a schema) and schemaless are the same, so I find them confusing.

They are not quite - a previously bound schema has a required field_id (reference to PartitionField), while unbound has an optional field_id (Option<i32>) as it references UnboundPartitionField. TableMetadata.partition_specs requires field_id, so we shouldn't use UnboundPartitionSpec there.

A PartitionSpec is only valid for the schema is had been build against, and we should not imply otherwise.

This is not entirely correct. For example, when a field is being promoted in a valid way, then it is still valid to bind against the same source-id.

True, that was too strict. What I should have expressed it that it is not guaranteed to be compatible with the current schema, which is why Java uses bindUnchecked(). I think with the current implementation we are quite close to what you propose - binding a SchemalessPartitionSpec or UnboundPartitionSpec to a new schema is easily possible by using spec.to_unbound().bind(new_schema). Currently this binding can fail. What is missing is the notion that you described with

When the field is not in the current spec anymore, it is also unlikely that you would filter on that field, therefore that field from the partition-spec could also be ignored

This would require a more permissive bind method, or an additional flag, that would ignore unknown fields, similar to what you try to incorporate into Java with Any. The important point is that dynamic re-binding is already possible today.

The BoundPartitionSpec serves two additional purposes - it is a way to compute the binding and store the result typesafe. This is useful for

  1. Pre-computing the bind, as we currently do for default_spec against the current_schema - which is the most important combination. I don't see why we shouldn't store the resulting type as we need to assert compatibility anyway for this combination.
  2. Have the means to get the source field names for a partition specs. This is useful when a spec is transferred from one context to another. In Java this is used for addPartitionSpec which eventually runs this code: https://github.com/apache/iceberg/blob/acd7cc1126b192ccb53ad8198bda37e983aa4c6c/core/src/main/java/org/apache/iceberg/TableMetadata.java#L768. Note that currently we don't have this semantic in the builder, which in java performs name-mapping to potentially change the source_id. Instead we pass an UnboundSpec which trusts the provides source_id and not the field name.

Summing it up, my approach would be to keep all three types - I simply don't see a good way around them. However, we should add a more permissive bind method that incorporates better the semantic of binding to an incompatible schema. This would be an alternative to the bindUnchecked() in Java. Maybe we can just follow the PR you are working on in Java once that's ready. This method would not necessarily be able to determine a type for each field in a spec.