Netflix / iceberg

Iceberg is a table format for large, slow-moving tabular data
Apache License 2.0
478 stars 60 forks source link

Allow Specifying Partitioning Function for External Mappings #100

Open omervk opened 5 years ago

omervk commented 5 years ago

(this is dependent upon the completion of #71 and #72)

The partition function for external mappings is derived from the parsing of the path of data files a-la Hive's format.

For instance the structure:

/date=2018-11-12/file.avsc
/date=2018-11-13/file.avsc

Would create a new column date with with string values 2018-11-12 and 2018-11-13 and assume the partitioning function is identity(date) instead of being able to derive it from another field (i.e. a function of the date part of a timestamp column).

Iceberg should let users specify their own partitioning function, based on existing columns.

rdblue commented 5 years ago

I think what you're trying to accomplish would be done a little differently. I understand the term "partitioning function" to mean the partition transformations that are part of a partition spec.

That's not the right place to do this because we don't need to add extra representations of a date to the manifest files. Instead, a process importing files from an external source should parse the strings and produce the right data value (day ordinal from 1970-01-01=0) for the date. Then Iceberg would use the same partition code for these files.