[Bug] `spark__list_relations_without_caching` expects legacy `schema` field

JCZuurmond commented 5 months ago

Is this a new bug in dbt-spark?

[X] I believe this is a new bug in dbt-spark
[X] I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

spark__list_relations_without_caching expects legacy fieldrelation.schema

{% macro spark__list_relations_without_caching(relation) %}
  {% call statement('list_relations_without_caching', fetch_result=True) -%}
    show table extended in {{ relation.schema }} like '*'
  {% endcall %}

  {% do return(load_result('list_relations_without_caching').table) %}
{% endmacro %}

Expected Behavior

spark__list_relations_without_caching expects relation

{% macro spark__list_relations_without_caching(relation) %}
  {% call statement('list_relations_without_caching', fetch_result=True) -%}
    show table extended in {{ relation }} like '*'
  {% endcall %}

  {% do return(load_result('list_relations_without_caching').table) %}
{% endmacro %}

Steps To Reproduce

N.A.

Relevant log output

No response

Environment

Irrelevant

Additional Context

See Spark SQL migration guide

jtcohen6 commented 4 months ago

Hey @JCZuurmond, good to hear from you!

Here's my understanding of the situation:

For consistency across adapters, dbt calls the third-level namespace database, the second-level namespaceschema, and the first-level nameidentifier(also configurable asalias`)
Historically, SparkSQL had no third-level namespace, and it used the words schema and database interchangeably for the second-level namespace
In Spark 3.2, the official names for these became catalog (third-level) and namespace (second-level)

I think the right next step is to support catalog and namespace as official aliases for database and schema, respectively.

There's a mechanism to do that within credentials by defining _ALIASES, as dbt-databricks does here
We could also define catalog and namespace classmethods on SparkRelation that return database and schema, respectively

Is that something you'd be interested in contributing?

stegus64 commented 1 month ago

This issue is the root cause of this problem: https://github.com/dbt-labs/spark-utils/issues/38

This code does not work any more:

https://github.com/dbt-labs/spark-utils/blob/f792c519e68b64e3411508bfa5f41a02e8646372/macros/maintenance_operation.sql#L4

{% for database in sparklist_schemas('not_used') %} {% for table in sparklist_relations_without_caching(database[0]) %}

The value returned by spark__list_schemas() is the result of SHOW DATABASES which only contains one single column named "databaseName"

This means that relation.schema in spark__list_relations_without_caching returns an empty string which means that

show table extended in {{ relation.schema }} like '*'

causes a syntax error in SQL.

I am not sure why .schema was added in this commit #972. For my purpose just changing "relation.schema" to "relation" fixes the issue.

I do not know what other problems such a change might cause.

It seems that #972 is a breaking change.

dbt-labs / dbt-spark