Parsing of `partition-spec` JSON from Avro manifest files is not to spec, causing deserialization to fail on files written by pyiceberg

sdd commented 1 week ago

Currently, when we deserialize an Avro manifest, we get the PartitionSpec by deserializing partition-spec into a Vec<PartitionField>, then deserializing partition-spec-id into a string, and then build a PartitionSpec out of these two parts:

https://github.com/apache/iceberg-rust/blob/854171d42199f756d2ad1a81a742ce61d9299f05/crates/iceberg/src/spec/manifest.rs#L744-L773

But the Iceberg spec expects partition-spec to be encoded as an object like this: https://iceberg.apache.org/spec/?h=avro#partition-specs

In reality we need to deserialize partition-spec directly into a PartitionSpec (https://github.com/apache/iceberg-rust/blob/854171d42199f756d2ad1a81a742ce61d9299f05/crates/iceberg/src/spec/partition.rs#L50), not a Vec<PartitionField>.

I can submit a PR to fix. But, we might want to:

First try to parse partition-spec into a PartitionSpec as per the spec,
If that fails, revert to the current behaviour

How should we do this?

Fokko commented 1 week ago

@sdd I think this is an issue on the PyIceberg side that has been fixed over the weekend: https://github.com/apache/iceberg-python/pull/846

sdd commented 6 days ago

Ok, I see. The spec is very confusing on this, with partition-specs being stored in different ways in different places. What's the reason for this?

liurenjie1024 commented 6 days ago

Ok, I see. The spec is very confusing on this, with partition-specs being stored in different ways in different places. What's the reason for this?

cc @Fokko @rdblue

Fokko commented 6 days ago

Ok, I see. The spec is very confusing on this

I think that the bug is a perfect example of this :)

For PyIceberg we don't really utilize this, but I can see three reasons:

Debugging: With Iceberg it can that different processes are writing to the table, this helps to debug and see what kind of schemas are being used. For example, if you have a streaming job that is writing some Parquet, in the meantime a compatible schema update is being applied, then the old version can still be written (since the change is compatible). It is nice to have some additional ways of tracking this.
Expiration of schemas and partition-specs: This is not done today, but with the design of Iceberg it is possible to expire older schemas, specs and sort orders at some point. As you can imagine the metadata JSON can grow quite big if there are a lot of schema updates.
Distributed computing You don't have to serialize all the schemas and send it to the workers, but you can read it from the Avro metadata.

rdblue commented 6 days ago

The reason that both partition-spec and partition-specs are used is that partition-spec was the original way the current spec was tracked in v1. However, because we wanted to enable partition evolution, we added partition-specs and a default-spec-id to track which one should be used when writing. In v2, we deprecated partition-spec.

You should always use partition-specs and default-spec-id, unless the table is old and was written without partition-specs. In that case, fall back to using partition-spec. This is documented in the spec; note the deprecation for partition-spec in the metadata table and the notes in the v2 section.

sdd commented 3 days ago

Thanks Ryan. I think there's some confusion here still though: you mention default-spec-id but I'm talking about manifest files here, not the table metadata. I assume you mean partition-spec-id rather than default-spec-id?

The spec currently says that manifest files should contain partition-spec and partition-spec-id fields, with the value of partition-spec consisting of a list of fields. But, I was encountering manifest files in the wild in which the value for partition-spec was a JSON object of the format { "fields": [<list of fields>], "spec_id": 1 }.

In my situation, iceberg writes were mainly being done by the Java iceberg lib but some were coming from pyiceberg, and I was only seeing the unexpectedly formatted manifest files occasionally, suggesting that these were coming from pyiceberg.

The confusion in the spec comes from the fact that in the table metadata and manifest file sections of the spec, partition specs are described as being serialized into an id and separate list of fields. But at the bottom of the spec in the "Partition Specs" section of Appendix C, it states that:

In some cases partition specs are stored using only the field list instead of the object format that includes the spec ID, like the deprecated partition-spec field in table metadata. The object format should be used unless otherwise noted in this spec.

Since I was seeing manifest files with partition spec metadata that appeared to conform to the object format mentioned here, and the wording here implies that this format should be used, I assumed that the manifest files I was seeing were correct.

apache / iceberg-rust

Parsing of `partition-spec` JSON from Avro manifest files is not to spec, causing deserialization to fail on files written by pyiceberg #419