Netflix / iceberg

Iceberg is a table format for large, slow-moving tabular data
Apache License 2.0
476 stars 59 forks source link

(WIP) Implement external mapping of IDs #80

Closed YuvalItzchakov closed 5 years ago

YuvalItzchakov commented 5 years ago

Initial fix for https://github.com/Netflix/iceberg/issues/71

This implementation introduces AvroNamedSchemaVisitor<T> and the AvroToIcebergSchemaVisitor<iceberg.Schema> types in order to overcome the fact that AvroSchemaVisitor only passes the schema at the field level, neglecting to pass in the field names which are needed generate the schema.

There are a few "workarounds" in the implementation of the record method which I don't really like, still need to iterate on that.

There are still open issues which need to be resolved:

  1. Add Parquet visitor with the same logic of sequential IDs
  2. The field ids mappings start with 0, and they also map structs in a table with ids (is this desired?)
  3. Field id mappings assign ids to the top level fields first, and then iterate complex type
  4. There is still a gap regarding the table properties update, which seems essential in the open issue, but I still don't have the full mental picture to implement it
  5. More tests need to be created for all types of Avro schemas (currently only Record is tested, need to add tests for Map, Union, Array, etc..)
  6. Need to make sure the logic for assigning ids holds for all these types
rdblue commented 5 years ago

@YuvalItzchakov, I think the main goal of this PR is to take a mapping and an Avro schema without IDs and produce something that can be passed into buildAvroProjection. That method takes a file schema, an Iceberg read schema, and a set of renames.

The work here produces an Iceberg schema with IDs from an Avro schema, but that's not exactly what is needed to be passed into buildAvroProjection. So I think this should actually produce an Avro schema with the right IDs from a field name to ID mapping.

YuvalItzchakov commented 5 years ago

@rdblue Thank you for taking the time to review everything! Frankly, I was not sure this code was in the right direction at all, because as you've mentioned, SchemaToType does exactly the work which is done here (I learned that after I wrote the code).

I'll start working on the changes.

rdblue commented 5 years ago

@YuvalItzchakov, if you want to continue working on this, please re-open it in the apache/incubator-iceberg repository. That's the project's new home. Thanks!