apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.4k stars 2.21k forks source link

Spec: The spec about metadata key `schema-id` in manifest file do not match the lib implementation #10927

Open hantangwangd opened 2 months ago

hantangwangd commented 2 months ago

Query engine

N/A

Question

Referring to: https://iceberg.apache.org/spec/#manifests, iceberg spec about manifests defines that, a manifest file must store the partition spec and other metadata as properties in the Avro file's key-value metadata. Among these key-value metadata, schema-id has changed from optional to required in v2 metadata.

However, in any implementation version of ManifestWriter, we did not write schema-id into the metadata of the corresponding Avro file at all. This looks inconsistent with the spec. And furthermore, seems there is no need to write this key-value property as it is not used anywhere. So should we fix this inconsistency? Or did I misunderstand something?

ajantha-bhat commented 2 months ago

@hantangwangd: Good catch.

JSON representation of schema already have the schema-id. So, I am not sure why we need one more field schema-id in the Avro metadata?

Also, if we see partition-spec it is the JSON representation of the fields of the spec. Not the entire spec. Hence, we need partition-spec-id.

the schema-id field seems irrelevant in the spec. Lets see what others think.