Name mapping is used when the files in the table don't have field-IDs encoded in the Parquet files. For example, when adding files through add_files in the case of a table migration from Hive, the Parquet files don't have field-IDs in them. In this case we want to make use of name-mapping: https://iceberg.apache.org/spec/#name-mapping-serialization This is a JSON blob that's stored alongside the table in a table property.
Future tip: It is best to store this in a recursive field so it can be traversed using a VisitorWithParent where both a Schema and NameMapping can be traversed at once. This is important because we cannot flatten the name-mapping because of potential dots in the field name, and this disallows us to split between fields and subfields. This is done in PyIceberg here: https://github.com/apache/iceberg-python/pull/1014
Name mapping is used when the files in the table don't have field-IDs encoded in the Parquet files. For example, when adding files through
add_files
in the case of a table migration from Hive, the Parquet files don't have field-IDs in them. In this case we want to make use of name-mapping: https://iceberg.apache.org/spec/#name-mapping-serialization This is a JSON blob that's stored alongside the table in a table property.This issue is solely on the deserialization of the JSON blob into a memory structure. Tests can be found here: https://github.com/apache/iceberg-python/blob/main/tests/table/test_name_mapping.py
Future tip: It is best to store this in a recursive field so it can be traversed using a
VisitorWithParent
where both aSchema
andNameMapping
can be traversed at once. This is important because we cannot flatten the name-mapping because of potential dots in the field name, and this disallows us to split between fields and subfields. This is done in PyIceberg here: https://github.com/apache/iceberg-python/pull/1014