Open rosspalmer opened 11 months ago
+1 this makes it hard to create tests for code that uses multiple catalogs in production
+1
+1 Any workaround to be able to test 3-layer-namespace?
Hi! We're encountering the same problem. Have you made any progress or found a workaround to resolve this issue?
Any update or does anyone have a workaround for this ?
+1
Just confirming I am still being blocked by this. We have a "workaround" where we squish the catalog and database names together when running locally but its not pretty...
I am also facing this issue
I'd also like to see this address for unit tests
I've confirmed that (1) using Delta catalog and an Iceberg catalog works ✅ (2) using two Iceberg catalogs works ✅ (3) using two Delta catalogs fails ❌
WIP investigating why
+1 i'm facing the same problem while using 2 delta catalogs
+1
+1
+1
+1
is there any feedback on this? seems like iceberg is the only way to do this?
+1
+1 in pyspark have the same issue in test below and at the end it will thrown with Parsing exception like
[PARSE_SYNTAX_ERROR] Syntax error at or near '.'.(line 1, pos 20)
== SQL == spark_catalog.source.source_table_join --------------------^^^ and the same does not work for DeltaTable.createOrReplace if I use full qualified name catalog.schema.table `
@pytest.fixture(scope="session")
def spark_session():
shutil.rmtree("spark-warehouse", ignore_errors=True)
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
builder = (SparkSession.builder
.master("local[*]")
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.databricks.delta.schema.autoMerge.enabled", "true")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.appName("test")
)
yield configure_spark_with_delta_pip(builder).getOrCreate()
shutil.rmtree("spark-warehouse", ignore_errors=True)
@pytest.mark.usefixtures("spark_session")
def test_join_operation_with_catalog(self, spark_session: SparkSession):
source_schema = StructType([
StructField("id", StringType(), True),
StructField("derived_column", StringType(), True),
StructField("filter_column", StringType(), True)
])
spark_session.sql("CREATE SCHEMA source")
spark_session.sql("DROP TABLE IF EXISTS spark_catalog.source.source_table_join")
spark_session.catalog.setCurrentCatalog("spark_catalog")
spark_session.catalog.setCurrentDatabase("source")
DeltaTable.createOrReplace(spark_session).tableName("source_table_join").addColumns(
source_schema).execute()
try:
print(DeltaTable.forName(spark_session, "source.source_table_join").toDF().collect())
print('SUCCESS')
except Exception as err:
print("FAILURE")
print(err)`
@scottsand-db I've tested this with Spark 4.0-preview2 and Delta Lake 4.0 preview, same issue. This should be fixed before shipping Delta for 4.0 at least.
Bug
Which Delta project/connector is this regarding?
Describe the problem
As of Spark 3.4.0, native support for 3-layer-namespaces for tables was added into SQL API, allowing multiple catalogs to be accessed through using a full table name of the
<catalog>.<schema>.<table>
convention. Multiple catalogs can be set using thespark.sql.catalog.<catalog_name>=...
spark config.This works when using the Apache Iceberg example below, but does not work when utilizing multiple Delta catalogs. While the SparkSession is initiated with the catalog present in the session, when a second, non
spark_catalog
catalog is referenced, the following exception is thrown.[INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace.
Here is a recent StackOverflow post experiencing the same issue with PySpark: https://stackoverflow.com/questions/77751057/multiple-catalogs-in-spark
Steps to reproduce
I am running this on my local machine, in client mode, using my local filesystem to host data.
Here is my SparkSession generator:
Here is a batch of testing code:
Observed results
When running with the example above, the
spark.catalog.setCurrentCatalog("catalog_b")
command works but then the followingspark.sql("create schema here")
command throws the exception below:Expected results
I would expect this to create a schema
here
in the catalogcatalog_b
and allow me to save data to it.Further details
This is an effort to create a local "delta lake" for testing which can be compatible with Databrick's three layer namepace used by their Unity Catalog.
Environment information
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?