apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.43k stars 2.22k forks source link

REST Catalog does not validate "to" identifier on rename table #11154

Open haizhou-zhao opened 1 month ago

haizhou-zhao commented 1 month ago

Query engine

Spark

Question

Background

Spark will pass catalog name to renameTable operations as part of its to identifier, and if that catalog name is not handled (i.e. stripped), then it will be treated as part of the namespace name. This will cause spark to rename the table into undesirable namespaces (or even just encounter namespace not exist issue and fail).

Traditional catalog like HiveCatalog has special pre-validation in renameTable operation for this purpose (ref: https://github.com/apache/iceberg/pull/1156), while RESTCatalog and JdbcCatalog (ref REST server implementation is adapter on JdbcCatalog) does not have it. This is causing spark command like ALTER TABLE ${tbl} RENAME TO ${tbl_rename} to fail on RESTCatalog/JdbcCatalog such as in this integration test: https://github.com/apache/iceberg/blob/316f0a1/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestAlterTable.java#L296

Part of https://github.com/apache/iceberg/issues/11079

Questions

  1. Should RESTCatalog (client) validate to identifier in renameTable operation just like what HiveCatalog is doing?
  2. If RESTCatalog, like hive, always strip the first level of namespace for to identifier, then what if the intention is to rename to a legitimate multi-level namespace?
nastra commented 1 month ago

I analyzed this and RenameTableExec indeed adds the namespace to the to identifier: https://github.com/apache/spark/blob/branch-3.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/RenameTableExec.scala#L46-L51.

Stripping the catalog name was added to Hive with #1156, but what I think we need to investigate here is why this issue only happens in testing and not when running through the Spark quickstart example, which also uses the REST catalog.

haizhou-zhao commented 1 month ago

Most likely it's related to whether you provide catalog name to rename statement or not.

Here's what I did:

  1. Spin up the docker compose, so that an REST catalog is up on localhost:8181
  2. Start spark-shell with the following config:
    spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.iceberg.catalog-impl=org.apache.iceberg.rest.RESTCatalog
    spark.sql.catalog.iceberg.uri=http://localhost:8181/
    spark.sql.catalog.iceberg.io-impl=org.apache.iceberg.io.ResolvingFileIO
    spark.sql.catalog.iceberg.warehouse=file://tmp/warehouse/
    spark.sql.defaultCatalog=iceberg
  3. Run the following two sets of commands, even though they are expected to yield the same result, case 1 will fail, yet case 2 will succeed. The difference is whether the catalog name is explicitly mentioned or not. Case 1:
    spark.sql("CREATE NAMESPACE iceberg.ns1")
    spark.sql("CREATE NAMESPACE iceberg.ns2")
    spark.sql("CREATE TABLE iceberg.ns1.tbl1 (id INT, data STRING)")
    spark.sql("ALTER TABLE iceberg.ns1.tbl1 RENAME TO iceberg.ns2.tbl2")

    Case 2:

    spark.sql("CREATE NAMESPACE ns1")
    spark.sql("CREATE NAMESPACE ns2")
    spark.sql("CREATE TABLE ns1.tbl1 (id INT, data STRING)")
    spark.sql("ALTER TABLE ns1.tbl1 RENAME TO ns2.tbl2")

Explanation: If catalog name is explicitly mentioned, then it will be submitted as part of the to identifier to construct RenameTableExec. As RESTCatalog doesn't have capability to remove catalog name from the identifier, the rename operation will fail.