Netflix / iceberg

Iceberg is a table format for large, slow-moving tabular data
Apache License 2.0
472 stars 59 forks source link

Add an API to maintain external schema mappings #72

Closed rdblue closed 5 years ago

rdblue commented 5 years ago

Once Iceberg supports external schema mappings, it should also support an easy way to maintain those mappings by notifying Iceberg when an external schema changes. Iceberg would update its mapping when notified.

For example, starting with this mapping:

[ {"field-id": 1, "names": ["id"]},
  {"field-id": 2, "names": ["data"]} ]

Consider a new Avro schema registered that changes the name id to obj_id and adds a ts field. Iceberg would add an un-mapped entry for ts and add obj_id to the id mapping based on the Avro schema's field alias that indicates id and obj_id are the same field. The updated mapping would be:

[ {"field-id": 1, "names": ["obj_id", "id"]},
  {"field-id": 2, "names": ["data"]},
  {"names": ["ts"]} ]

Next, if the Iceberg table schema is updated to add ts, the mapping would be updated by matching the new Iceberg column to the unmatched mapping entry to produce this mapping:

[ {"field-id": 1, "names": ["rec_id", "id"]},
  {"field-id": 2, "names": ["data"]},
  {"field-id": 3, "names": ["ts"]} ]

This would maintain compatibility with new Avro data files without making changes to the Iceberg table other than the mapping. Columns can be added in Iceberg or Avro first and the mapping is completed by column name when it is added in both schemas.

rdblue commented 5 years ago

This is a follow-up to #71.

rdblue commented 5 years ago

This has been moved to the ASF project: https://github.com/apache/incubator-iceberg/issues/41