Netflix / iceberg

Iceberg is a table format for large, slow-moving tabular data
Apache License 2.0
472 stars 59 forks source link

Add external schema mappings for data without field IDs #71

Closed rdblue closed 5 years ago

rdblue commented 5 years ago

Files written by Iceberg writers contain Iceberg field IDs that are used for column projection. Iceberg doesn't currently support tracking data files that were written by other systems and added to Iceberg tables with the API because the field IDs are missing. To support files written by non-Iceberg writers, Iceberg could support a table-level mapping from a source schema to Iceberg IDs.

For example, a table with 2 columns might have an Avro schema mapping like this one, encoded as JSON in table properties:

[ {"field-id": 1, "names": ["id"]},
  {"field-id": 2, "names": ["data"]} ]

When reading an Avro file, the read schema would be produced using the file's schema and the field IDs from the mapping. The names in each field mapping is a list to handle aliasing.

govi20 commented 5 years ago

I would like to work on this issue.

YuvalItzchakov commented 5 years ago

@govi20 I have already started working on this issue, I'd love to pair up if you want :)

rdblue commented 5 years ago

I've moved this to https://github.com/apache/incubator-iceberg/issues/40