apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.24k stars 2.17k forks source link

Using the Iceberg catalog in your file system #10326

Open 911432 opened 4 months ago

911432 commented 4 months ago

Feature Request / Improvement

Just as we can now store our iceberg catalog in HDFS, we also want to store it in other file systems such as S3. You can then quickly configure it as a container image, including query engines and storage.

Query engine

None

nastra commented 4 months ago

@911432 can you please elaborate what the goal here is? Everything you described is already possible today.

911432 commented 4 months ago

I would like to store the query engine as a container image and the iceberg table and iceberg catalog as a file system. Let's take this spark page as an example. The code below is always valid.

spark.sql.catalog.hadoop_prod = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hadoop_prod.type = hadoop
spark.sql.catalog.hadoop_prod.warehouse = hdfs://nn:8020/warehouse/path

I wish I could do the code below as well.

spark.sql.catalog.s3 = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.s3.type = s3
spark.sql.catalog.s3.warehouse = s3://nn:8020/warehouse/path
spark.sql.catalog.file = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.file.type = file
spark.sql.catalog.file.warehouse = file://warehouse/path

I think it will make the spark-quickstart page easier. And I think I can distinguish between computing and storage more clearly.

nastra commented 4 months ago

spark.sql.catalog.<catalogName>.type refers to different catalog implementation types and is not related to the naming of the catalog. It is basically just a shortcut to specifying the fully-qualified catalog name via spark.sql.catalog.<catalogName>.catalog-impl=org.apache.iceberg.hadoop.HadoopCatalog.

Available catalog types are:

hadoop -> org.apache.iceberg.hadoop.HadoopCatalog hive -> org.apache.iceberg.hive.HiveCatalog rest -> org.apache.iceberg.rest.RESTCatalog glue -> org.apache.iceberg.aws.glue.GlueCatalog nessie -> org.apache.iceberg.nessie.NessieCatalog jdbc -> org.apache.iceberg.jdbc.JdbcCatalog

911432 commented 4 months ago

I know Spark.sql.catalog.hadoop_prod.uri doesn't seem to exist. Similarly, for s3 and file, I hope Spark.sql.catalog.<catalogName>.warehouse is sufficient even without Spark.sql.catalog.<catalogName>.uri.

BsoBird commented 4 months ago

Hi, I've done some work on fixing hadoop_catalog before.

In my experience, to use a filesystem-based catalog, you currently need to rely on the filesystem to provide atomic rename operations. Object stores often do not have atomic operations. To use fileSystem_catalog with object storage, you must use some additional middleware to provide atomicity to file system operations.

In addition, this type of middleware often provides multiple access protocols, such as HDFS/S3/POSIX. When you use this type of middleware proxy to access the object store, it seems that hadoop_catalog is already sufficient.

Of course, this is just the status quo. I think there is a lot of work that needs to be done if you want to implement the basic functionality of catalog management on an object store that does not have atomic operations. We can discuss this further if you are interested.

But please keep in mind that this is not recommended in the current version.

@911432 @nastra

@911432 Also, I see that you have submitted some PRs for apache paimon, and I'm sure you'd like paimon to have similar functionality, but unfortunately, paimon still has consistency issues with filesystem_catalog in s3. This is all due to the fact that the object store does not provide atomic operations.If you are interested, you can try it.