Netflix / iceberg

Iceberg is a table format for large, slow-moving tabular data
Apache License 2.0
472 stars 59 forks source link

Support Custom Hadoop Properties In The Data Source #91

Closed mccheah closed 5 years ago

mccheah commented 5 years ago

The Iceberg data source just uses the Spark Session's global Hadoop configuration when constructing File System objects in HadoopTableOperations, Reader, and Writer. We propose support for specifying additional reader and writer-specific options to the Hadoop configuration. The data source can parse out options with the prefix iceberg.spark.hadoop.* and apply those to the Hadoop configuration that is sent to all uses of the Hadoop FileSystem API throughout the Spark DataSource.

rdblue commented 5 years ago

So this would be similar to the behavior of spark.hadoop properties?

I think that would make sense. We want to configure writers primarily through Iceberg table configuration, and it would be good to have a way to do that other than just a small list of known and supported properties.

mccheah commented 5 years ago

Originally I was thinking we could put it in sparkSession.write.option(iceberg.spark.hadoop.<opt>) but now that you mention it this could make more sense to be in the Table's inherent properties. I can put up a patch for this.

rdblue commented 5 years ago

We want configuration to come from multiple places. A write may want to use a different format, so that would be passed like your example. But most things we want to apply at a table level and not at a job level. Parquet tuning parameters are a good example of that -- we want to set them once and know that all writers know what to do.

A PR would be great. Thanks!