apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.49k stars 2.24k forks source link

REST Catalog S3 Signer Endpoint should be Catalog specific #11608

Open c-thiel opened 1 day ago

c-thiel commented 1 day ago

Apache Iceberg version

1.7.0 (latest release)

Query engine

Spark

Please describe the bug šŸž

Currently when configuring two REST catalogs in spark, the s3.signer.uri of the first catalog is used also for the second catalog.

During initial connect to the REST catalog, the catalog may return a s3.signer.uri attribute as part of the overrides of the /v1/config endpoint. This property seems to be set globally for the spark session. Whichever catalog I use first, the sign request for the second catalog is sent to the sign endpoint of the first. Using each catalog separately works perfectly fine.

I tested with one Lakekeeper where we use different sign endpoints for each warehouse as well as with two Nessies. Warehouses share the same bucket but use different path prefixes in my tests.

My spark configuration looks like this:

    "spark.sql.catalog.catalog1": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.catalog1.type": "rest",
    "spark.sql.catalog.catalog1.uri": CATALOG_1_URL,
    "spark.sql.catalog.catalog1.warehouse": "warehouse_1",
    "spark.sql.catalog.catalog1.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
    "spark.sql.catalog.catalog1.s3.remote-signing-enabled": "true",
    "spark.sql.catalog.catalog2": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.catalog2.type": "rest",
    "spark.sql.catalog.catalog2.uri": CATALOG_2_URL,
    "spark.sql.catalog.catalog2.warehouse": "warehouse_2",
    "spark.sql.catalog.catalog2.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
    "spark.sql.catalog.catalog1.s3.remote-signing-enabled": "true",

If required, I can add a docker compose example as well. If someone could point me into the right direction, I might be able to create a fix PR.

Willingness to contribute

c-thiel commented 16 hours ago

This is not only a problem with spark but at least also affects starrocks. According to a user on our discord we see the same behavior as I describe for spark above:

I can confirm that both catalogs (lake and lake2) work perfectly fine when set up and used individually in StarRocks. I can create tables, insert data, and query without any issues when only one catalog is active at a time.

However, the problem arises when both catalogs are configured simultaneously. At that point, operations on the second catalog (like INSERT) fail.