Closed mccheah closed 5 years ago
Iceberg doesn't have a requirement for file locations. The Spark writer currently creates paths based on the table location, but it also has a strategy for sharding across S3 prefixes.
I'm fine adding ways to control data and metadata locations. Is your motivation to use Hadoop tables with S3 locations? If so, then you might want to use iceberg-hive instead.
Last, if we add a property like this then it should apply for all writers, not just Spark. We are trying to standardize behavior across engines, so the table should be configured, not the table in certain contexts.
In https://github.com/Netflix/iceberg/issues/92#issuecomment-439499151 I describe the case where we're using Iceberg as a temporary representation in local disk and then converting the metadata written by Iceberg to metadata in our internal store. This use case is what's driving this as well.
This has been submitted to the ASF project: https://github.com/apache/incubator-iceberg/pull/6
Currently the Iceberg Data Source Writer requires files to be written to a location relative to the location of the table's metadata files. However, this is an artificial requirement because the manifest specifies URIs of data files that are completely independent of the URI of the table's metadata file system. For example one might want their table metadata to be stored in HDFS but their data files to be stored in S3.
We propose supporting a data source option,
iceberg.spark.writer.dataLocation
, to allow for overriding the base directory URI of the data files that are to be written.