Open haizhou-zhao opened 1 month ago
I analyzed this and RenameTableExec
indeed adds the namespace to the to
identifier: https://github.com/apache/spark/blob/branch-3.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/RenameTableExec.scala#L46-L51.
Stripping the catalog name was added to Hive with #1156, but what I think we need to investigate here is why this issue only happens in testing and not when running through the Spark quickstart example, which also uses the REST catalog.
Most likely it's related to whether you provide catalog name to rename statement or not.
Here's what I did:
spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg.catalog-impl=org.apache.iceberg.rest.RESTCatalog
spark.sql.catalog.iceberg.uri=http://localhost:8181/
spark.sql.catalog.iceberg.io-impl=org.apache.iceberg.io.ResolvingFileIO
spark.sql.catalog.iceberg.warehouse=file://tmp/warehouse/
spark.sql.defaultCatalog=iceberg
spark.sql("CREATE NAMESPACE iceberg.ns1")
spark.sql("CREATE NAMESPACE iceberg.ns2")
spark.sql("CREATE TABLE iceberg.ns1.tbl1 (id INT, data STRING)")
spark.sql("ALTER TABLE iceberg.ns1.tbl1 RENAME TO iceberg.ns2.tbl2")
Case 2:
spark.sql("CREATE NAMESPACE ns1")
spark.sql("CREATE NAMESPACE ns2")
spark.sql("CREATE TABLE ns1.tbl1 (id INT, data STRING)")
spark.sql("ALTER TABLE ns1.tbl1 RENAME TO ns2.tbl2")
Explanation:
If catalog name is explicitly mentioned, then it will be submitted as part of the to
identifier to construct RenameTableExec. As RESTCatalog doesn't have capability to remove catalog name from the identifier, the rename operation will fail.
Query engine
Spark
Question
Background
Spark will pass
catalog
name torenameTable
operations as part of itsto
identifier, and if thatcatalog
name is not handled (i.e. stripped), then it will be treated as part of thenamespace
name. This will cause spark to rename the table into undesirable namespaces (or even just encounter namespace not exist issue and fail).Traditional catalog like HiveCatalog has special pre-validation in renameTable operation for this purpose (ref: https://github.com/apache/iceberg/pull/1156), while RESTCatalog and JdbcCatalog (ref REST server implementation is adapter on JdbcCatalog) does not have it. This is causing spark command like
ALTER TABLE ${tbl} RENAME TO ${tbl_rename}
to fail on RESTCatalog/JdbcCatalog such as in this integration test: https://github.com/apache/iceberg/blob/316f0a1/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestAlterTable.java#L296Part of https://github.com/apache/iceberg/issues/11079
Questions
to
identifier inrenameTable
operation just like what HiveCatalog is doing?to
identifier, then what if the intention is to rename to a legitimate multi-level namespace?