Closed sdd closed 6 days ago
@sdd I think this is an issue on the PyIceberg side that has been fixed over the weekend: https://github.com/apache/iceberg-python/pull/846
Ok, I see. The spec is very confusing on this, with partition-specs being stored in different ways in different places. What's the reason for this?
Ok, I see. The spec is very confusing on this, with partition-specs being stored in different ways in different places. What's the reason for this?
cc @Fokko @rdblue
Ok, I see. The spec is very confusing on this
I think that the bug is a perfect example of this :)
For PyIceberg we don't really utilize this, but I can see three reasons:
The reason that both partition-spec
and partition-specs
are used is that partition-spec
was the original way the current spec was tracked in v1. However, because we wanted to enable partition evolution, we added partition-specs
and a default-spec-id
to track which one should be used when writing. In v2, we deprecated partition-spec
.
You should always use partition-specs
and default-spec-id
, unless the table is old and was written without partition-specs
. In that case, fall back to using partition-spec
. This is documented in the spec; note the deprecation for partition-spec
in the metadata table and the notes in the v2 section.
Thanks Ryan. I think there's some confusion here still though: you mention default-spec-id
but I'm talking about manifest files here, not the table metadata. I assume you mean partition-spec-id
rather than default-spec-id
?
The spec currently says that manifest files should contain partition-spec
and partition-spec-id
fields, with the value of partition-spec
consisting of a list of fields. But, I was encountering manifest files in the wild in which the value for partition-spec
was a JSON object of the format { "fields": [<list of fields>], "spec_id": 1 }
.
In my situation, iceberg writes were mainly being done by the Java iceberg lib but some were coming from pyiceberg, and I was only seeing the unexpectedly formatted manifest files occasionally, suggesting that these were coming from pyiceberg.
The confusion in the spec comes from the fact that in the table metadata and manifest file sections of the spec, partition specs are described as being serialized into an id and separate list of fields. But at the bottom of the spec in the "Partition Specs" section of Appendix C, it states that:
In some cases partition specs are stored using only the field list instead of the object format that includes the spec ID, like the deprecated partition-spec field in table metadata. The object format should be used unless otherwise noted in this spec.
Since I was seeing manifest files with partition spec metadata that appeared to conform to the object format mentioned here, and the wording here implies that this format should be used, I assumed that the manifest files I was seeing were correct.
Currently, when we deserialize an Avro manifest, we get the
PartitionSpec
by deserializingpartition-spec
into aVec<PartitionField>
, then deserializingpartition-spec-id
into a string, and then build aPartitionSpec
out of these two parts:https://github.com/apache/iceberg-rust/blob/854171d42199f756d2ad1a81a742ce61d9299f05/crates/iceberg/src/spec/manifest.rs#L744-L773
But the Iceberg spec expects partition-spec to be encoded as an object like this: https://iceberg.apache.org/spec/?h=avro#partition-specs
In reality we need to deserialize
partition-spec
directly into aPartitionSpec
(https://github.com/apache/iceberg-rust/blob/854171d42199f756d2ad1a81a742ce61d9299f05/crates/iceberg/src/spec/partition.rs#L50), not aVec<PartitionField>
.I can submit a PR to fix. But, we might want to:
partition-spec
into aPartitionSpec
as per the spec,How should we do this?