[BUG][SPARK] listTables() fails after createOrReplaceTempView('abc') called with PARSE_SYNTAX_ERROR #2610

Open richardcerny opened 7 months ago

richardcerny commented 7 months ago


Describe the problem

After upgrade from Spark spark_version 3.3.2 to 3.4.1 catalog.listTables command is always failing after the "createOrReplaceTempView" is called. See code snipped bellow.

Steps to reproduce

spark = (SparkSession
        .appName("Python Spark SQL basic example")
        .config("spark.jars.packages", "")
        .config("spark.sql.extensions", "")
        .config("spark.sql.catalog.spark_catalog", "")

SILVER_DB = 'silver_test'
view_name_fact = 'abc'

print(f"DBs before: {spark.catalog.listDatabases()}") # ok
print(f"Tables_before: {spark.catalog.listTables()}") # OK
print(f"Catalogs before: {spark.catalog.listCatalogs()}")  # ok
print(f'Current catalog before: {spark.catalog.currentCatalog()}')  # ok
print(f"Tables after silver: {spark.catalog.listTables(SILVER_DB)}") # ok

df_fact_fixture1 = spark.createDataFrame([Row('1', 'A', 'A', 100.0)])  # OK
df_fact_fixture1.createOrReplaceTempView(view_name_fact) # OK  ##### ONCE createOrReplaceTempView is called, afterward any command with spark.catalog.listTables() fails!!!!!!!!

spark.sql(f"select * from {view_name_fact}").show() # OK
df = spark.sql(f"select * from {view_name_fact}") # OK
assert 1 == df.count() # OK
print(f"DBs after: {spark.catalog.listDatabases()}")  # OK
print(f"Catalogs after: {spark.catalog.listCatalogs()}")   # OK
print(f'Current catalog after: {spark.catalog.currentCatalog()}')  # OK
print(f"Tables after: {spark.catalog.listTables()}") # ERROR
print(f"Tables after silver: {spark.catalog.listTables(SILVER_DB)}")  # ERROR

Observed results

DBs before: [Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/workspaces/repository-pipeline/spark-warehouse'), Database(name='silver_test', catalog='spark_catalog', description='', locationUri='file:/workspaces/repository-pipeline/spark-warehouse/silver_test.db')]
Tables_before: []
Catalogs before: [CatalogMetadata(name='spark_catalog', description=None)]
Current catalog before: spark_catalog
Tables after silver: []
| _1| _2| _3|   _4|
|  1|  A|  A|100.0|

DBs after: [Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/workspaces/repository-pipeline/spark-warehouse'), Database(name='silver_test', catalog='spark_catalog', description='', locationUri='file:/workspaces/repository-pipeline/spark-warehouse/silver_test.db')]
Catalogs after: [CatalogMetadata(name='spark_catalog', description=None)]
Current catalog after: spark_catalog

>   print(f"Tables after: {spark.catalog.listTables()}")


/usr/local/lib/python3.10/site-packages/pyspark/sql/ in listTables
    iter = self._jcatalog.listTables(dbName).toLocalIterator()
/usr/local/lib/python3.10/site-packages/py4j/ in __call__
    return_value = get_return_value(
a = ('xro77', <py4j.clientserver.JavaClient object at 0x7f98286213f0, 'o36', 'listTables'), kw = {}, converted = ParseException()

    def deco(*a: Any, **kw: Any) - Any:
            return f(*a, **kw)
        except Py4JJavaError as e:
            converted = convert_exception(e.java_exception)
            if not isinstance(converted, UnknownException):
                # Hide where the exception came from that shows a non-Pythonic
                # JVM exception message.
               raise converted from None
E               pyspark.errors.exceptions.captured.ParseException: 
E               [PARSE_SYNTAX_ERROR] Syntax error at or near end of input.(line 1, pos 0)
E               == SQL ==
E               ^^^

/usr/local/lib/python3.10/site-packages/pyspark/errors/exceptions/ ParseException

Expected results

Shows list of tables.

Further details

While removing following configuration from the spark session, the code works, but the catalog extension is necessary for other features.

        .config("spark.jars.packages", "")
        .config("spark.sql.extensions", "")
        .config("spark.sql.catalog.spark_catalog", "")

Environment information

mdrakiburrahman commented 4 months ago

We're hitting this as well, @richardcerny were you able to get to a resolution?

mdrakiburrahman commented 4 months ago

Found a workaround:

Went from:

def listTables(databaseName: String): Array[String] = {
    if (databaseExists(databaseName)) {
      return spark.catalog.listTables(databaseName).collect().map(

To this:

def listTables(databaseName: String): Array[String] = {
    if (databaseExists(databaseName)) {

      // Delta 2.4.0 has a regression with Spark 3.4.1 that makes
      // spark.catalog.listTables calls fail
      // >>>
      return spark
        .sql(s"SHOW TABLES IN $databaseName")
        .map(row => row.getAs[String]("tableName"))
richardcerny commented 4 months ago

thank you @mdrakiburrahman. We have used the same workaround.

felipepessoto commented 2 days ago

It seems the problem is this line, val isTemp = row.getBoolean(2):

returns false when the catalog is set to DeltaCatalog

You can see it by starting a spark shell with/without Delta and run


val namespace = Seq("spark_catalog", "default")
val plan = org.apache.spark.sql.catalyst.plans.logical.ShowTables(org.apache.spark.sql.catalyst.analysis.UnresolvedNamespace(namespace), None)

    val tables = spark.sessionState.executePlan(plan).toRdd.collect().map { row =>
      val tableName = row.getString(1)
      val namespaceName = row.getString(0)
      val isTemp = row.getBoolean(2)
      if (isTemp) {

        // Temp views do not belong to any catalog. We shouldn't prepend the catalog name here.
        // val ns = if (namespaceName.isEmpty) Nil else Seq(namespaceName)
        // makeTable(ns :+ tableName)
      } else {
        //val ns = parseIdent(namespaceName)
        val ns = spark.sessionState.sqlParser.parseMultipartIdentifier(namespaceName)
        //makeTable( +: ns :+ tableName)

@cloud-fan I have seen some contribs you did for Delta and Spark related to catalog. Any insights?

cloud-fan commented 1 day ago

@felipepessoto thanks for providing the repro! What was the error you hit? And can you also post the result of spark.sessionState.executePlan(plan).analyzed.treeString?

felipepessoto commented 1 day ago

@cloud-fan it is the same error that @richardcerny reported. In spark-shell, using my repro code:

[PARSE_SYNTAX_ERROR] Syntax error at or near end of input.(line 1, pos 0)

== SQL ==


  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:144)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseMultipartIdentifier(ParseDriver.scala:67)
  at $anonfun$tables$1(<console>:37)
  at $anonfun$tables$1$adapted(<console>:23)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at scala.collection.mutable.ArrayOps$
  ... 64 elided

Calling spark.catalog.listTables().show():

[PARSE_SYNTAX_ERROR] Syntax error at or near end of input.(line 1, pos 0)

== SQL ==


  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:306)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:144)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseMultipartIdentifier(ParseDriver.scala:67)
  at org.apache.spark.sql.internal.CatalogImpl.parseIdent(CatalogImpl.scala:49)
  at org.apache.spark.sql.internal.CatalogImpl.$anonfun$listTables$1(CatalogImpl.scala:132)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at scala.collection.mutable.ArrayOps$
  at org.apache.spark.sql.internal.CatalogImpl.listTables(CatalogImpl.scala:123)
  at org.apache.spark.sql.internal.CatalogImpl.listTables(CatalogImpl.scala:98)
  ... 47 elided


scala> println(spark.sessionState.executePlan(plan).analyzed.treeString)
ShowTables [namespace#2, tableName#3, isTemporary#4]
+- ResolvedNamespace, [default]
cloud-fan commented 18 hours ago

one workaround is to set spark.sql.legacy.useV1Command to true. Ideally DeltaCatalog should not return views in listTables.