dbt-labs / dbt-spark

dbt-spark contains all of the code enabling dbt to work with Apache Spark and Databricks
https://getdbt.com
Apache License 2.0
395 stars 221 forks source link

[Bug] dbt run will fail if default namespace doesn't exist. #1036

Closed 2ult4n closed 4 months ago

2ult4n commented 4 months ago

Is this a new bug in dbt-spark?

Current Behavior

I'm running a cluster with no default namespace however I'm trying to connect to different namespace which is nyc whenever I try to run or debug I encounter the following error:

(.venv) salsultan@Sultans-MacBook-Pro-2 gg % dbt debug
06:52:15  Running with dbt=1.7.13
06:52:15  dbt version: 1.7.13
06:52:15  python version: 3.11.9
06:52:15  python path: /Users/salsultan/Documents/Repos/dbt-iceberg-2/.venv/bin/python3.11
06:52:15  os info: macOS-13.2.1-arm64-arm-64bit
06:52:15  Using profiles dir at /Users/salsultan/.dbt
06:52:15  Using profiles.yml file at /Users/salsultan/.dbt/profiles.yml
06:52:15  Using dbt_project.yml file at /Users/salsultan/Documents/Repos/dbt-iceberg-2/gg/dbt_project.yml
06:52:15  adapter type: spark
06:52:15  adapter version: 1.7.1
06:52:15  Configuration:
06:52:15    profiles.yml file [OK found and valid]
06:52:15    dbt_project.yml file [OK found and valid]
06:52:15  Required dependencies:
06:52:15   - git [OK found]

06:52:15  Connection:
06:52:15    host: localhost
06:52:15    port: 10000
06:52:15    cluster: None
06:52:15    endpoint: None
06:52:15    schema: nyc
06:52:15    organization: 0
06:52:15  Registered adapter: spark=1.7.1
06:52:15    Connection test: [ERROR]

06:52:15  1 check failed:
06:52:15  dbt was unable to connect to the specified database.
The database returned the following error:

  >Runtime Error
  Database Error
    failed to connect

Check your database credentials and try again. For more information, visit:
https://docs.getdbt.com/docs/configure-your-profile
(.venv) salsultan@Sultans-MacBook-Pro-2 gg % dbt run 
06:52:20  Running with dbt=1.7.13
06:52:20  Registered adapter: spark=1.7.1
06:52:20  Found 2 models, 4 tests, 0 sources, 0 exposures, 0 metrics, 439 macros, 0 groups, 0 semantic models
06:52:20  
06:52:20  
06:52:20  Finished running  in 0 hours 0 minutes and 0.05 seconds (0.05s).
06:52:20  Encountered an error:
Runtime Error
  Runtime Error
    Database Error
      failed to connect

But if I created a namespace default in spark even though I didn't specify connecting to it in the profile:

(.venv) salsultan@Sultans-MacBook-Pro-2 gg % dbt debug
07:32:20  Running with dbt=1.7.13
07:32:20  dbt version: 1.7.13
07:32:20  python version: 3.11.9
07:32:20  python path: /Users/salsultan/Documents/Repos/dbt-iceberg-2/.venv/bin/python3.11
07:32:20  os info: macOS-13.2.1-arm64-arm-64bit
07:32:20  Using profiles dir at /Users/salsultan/.dbt
07:32:20  Using profiles.yml file at /Users/salsultan/.dbt/profiles.yml
07:32:20  Using dbt_project.yml file at /Users/salsultan/Documents/Repos/dbt-iceberg-2/gg/dbt_project.yml
07:32:20  adapter type: spark
07:32:20  adapter version: 1.7.1
07:32:20  Configuration:
07:32:20    profiles.yml file [OK found and valid]
07:32:20    dbt_project.yml file [OK found and valid]
07:32:20  Required dependencies:
07:32:20   - git [OK found]

07:32:20  Connection:
07:32:20    host: localhost
07:32:20    port: 10000
07:32:20    cluster: None
07:32:20    endpoint: None
07:32:20    schema: nyc
07:32:20    organization: 0
07:32:20  Registered adapter: spark=1.7.1
07:32:20    Connection test: [OK connection ok]

07:32:20  All checks passed!
(.venv) salsultan@Sultans-MacBook-Pro-2 gg % dbt run 
07:32:28  Running with dbt=1.7.13
07:32:28  Registered adapter: spark=1.7.1
07:32:28  Found 2 models, 4 tests, 0 sources, 0 exposures, 0 metrics, 439 macros, 0 groups, 0 semantic models
07:32:28  
07:32:28  Concurrency: 1 threads (target='dev')
07:32:28  
07:32:28  1 of 2 START sql table model nyc.my_first_dbt_model ............................ [RUN]
07:32:28  1 of 2 OK created sql table model nyc.my_first_dbt_model ....................... [OK in 0.33s]
07:32:28  2 of 2 START sql table model nyc.my_second_dbt_model ........................... [RUN]
07:32:29  2 of 2 OK created sql table model nyc.my_second_dbt_model ...................... [OK in 0.22s]
07:32:29  
07:32:29  Finished running 2 table models in 0 hours 0 minutes and 0.86 seconds (0.86s).
07:32:29  
07:32:29  Completed successfully
07:32:29  
07:32:29  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

Expected Behavior

Successfully be able to debug and run regardless of the state of default namespace.

Steps To Reproduce

Using the following docker setup for spark: https://github.com/tabular-io/docker-spark-iceberg

the following profile:

gg:
  outputs:
    dev:
      host: localhost
      method: thrift
      port: 10000
      schema: nyc
      type: spark
  target: dev

project:


# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'gg'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'gg'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

clean-targets:         # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"

# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models in the example/
# directory as views. These settings can be overridden in the individual model
# files using the `{{ config(...) }}` macro.
models:
  gg:
    # Config indicated by + and applies to all files under models/example/
    example:
      +materialized: table

Relevant log output

No response

Environment

- OS: macOS Ventura 13.2.1
- Python: 3.11
- dbt-core: 1.7.13
- dbt-spark: 1.17.1

Additional Context

from what I gathered that there is an initial check fired pre run or debug which execute the following query in order to check the connection: select 1 as id

I think it gets executed in without specifying a namespace so it will try to execute it in the default namespace which doesn't exist so the check will prevent both run or debug to be successful regardless if the configs are correct.

IMO i think it should specify the schema in the config to prevent this kind of errors.

Thanks

jtcohen6 commented 4 months ago

@2ult4n Thanks for opening, appreciate the thorough write-up.

My guess is that dbt is failing when it tries to run a metadata query (show table extended), and as you say, there is no default namespace in which to run it. You can look in the debug-level logs to confirm exactly what query dbt is running when it encounters the error (dbt run --debug or logs/dbt.log).

This doesn't feel like a priority for us to fix, but it would be a reasonable think to document here:

I'd welcome you to open a PR against the documentation repo, and to link back to this issue for context.