Support Customizing The Location Of Data Files Written By The Spark Data Source

Netflix / iceberg

Iceberg is a table format for large, slow-moving tabular data

Apache License 2.0

478 stars 60 forks source link

Support Customizing The Location Of Data Files Written By The Spark Data Source #93

Closed mccheah closed 5 years ago

mccheah commented 6 years ago

Currently the Iceberg Data Source Writer requires files to be written to a location relative to the location of the table's metadata files. However, this is an artificial requirement because the manifest specifies URIs of data files that are completely independent of the URI of the table's metadata file system. For example one might want their table metadata to be stored in HDFS but their data files to be stored in S3.

We propose supporting a data source option, iceberg.spark.writer.dataLocation, to allow for overriding the base directory URI of the data files that are to be written.

rdblue commented 5 years ago

Iceberg doesn't have a requirement for file locations. The Spark writer currently creates paths based on the table location, but it also has a strategy for sharding across S3 prefixes.

I'm fine adding ways to control data and metadata locations. Is your motivation to use Hadoop tables with S3 locations? If so, then you might want to use iceberg-hive instead.

Last, if we add a property like this then it should apply for all writers, not just Spark. We are trying to standardize behavior across engines, so the table should be configured, not the table in certain contexts.

mccheah commented 5 years ago

In https://github.com/Netflix/iceberg/issues/92#issuecomment-439499151 I describe the case where we're using Iceberg as a temporary representation in local disk and then converting the metadata written by Iceberg to metadata in our internal store. This use case is what's driving this as well.

rdblue commented 5 years ago

This has been submitted to the ASF project: https://github.com/apache/incubator-iceberg/pull/6