apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.45k stars 2.23k forks source link

Inconsistent id definition on Flink resolvedSchema conversion to iceberg schema #11128

Open tonycox opened 1 month ago

tonycox commented 1 month ago

Apache Iceberg version

1.6.1 (latest release)

Query engine

Flink

Please describe the bug 🐞

When I try to convert Flink ResolvedSchema to Iceberg Schema via

import org.apache.iceberg.flink.FlinkSchemaUtil
FlinkSchemaUtil.convert(tableEnv.fromDataStream(dataStream).resolvedSchema)

It returns schema definition

table {
  0: event_time: optional timestamptz
  1: name: optional string
  2: json_map: optional map<string, string>
}

which as I suppose is not correct. My assumption comes from whenever I call catalog.loadTable(id).schema() it returns

table {
  1: event_time: optional timestamptz
  2: name: optional string
  3: json_map: optional map<string, string>
}

and id validation will fail if let say I'll try to update schema upon extracted from Flink table.

Found lines of id definition: https://github.com/apache/iceberg/blob/799120636e8f5f19c1d7f217ab4968f524bb1246/flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/FlinkTypeToType.java#L187-L189

Willingness to contribute

pvary commented 1 month ago

If you already have an Iceberg table, the source of truth is the Iceberg table. Other conversions are there for generating the schema for the Iceberg table creation.

Generating the same ids is not easily solved, because schema evolution would cause "skipped" ids

tonycox commented 1 month ago

@pvary In the example the schema is the same, but in my case I wanted to have an "implicit" schema evolution on write. Say I'd add additional field to source event and on deployment step once the pipeline understands that the schema is updated it evolves target schema as well. Right now I'm skipping ids in schema validation everywhere, even in unit tests as they are inconsistent all the time and I rely only on the ordering of the fields and their existence/absence.

pvary commented 1 month ago

I'm facing a similar challenge. See: https://lists.apache.org/thread/vyw595d0747p33qg886b1o82mcw40523

The visitors could be used to traverse the schema, but you need to match them by name. This becomes problematic when the column names are reused