databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
195 stars 104 forks source link

1.8.0 not able to detect existing tables anymore #691

Closed thijs-nijhuis closed 3 weeks ago

thijs-nijhuis commented 1 month ago

Describe the bug

Dbt-databricks 1.8.0 doesn't seem to be able to determine if a table already exists anymore. I have noticed this with both seeds and incremental models.

Seeds We have a seed in our project called 'country'. When running dbt build --select country it succeeds if the country table is not created yet. But the second time you run it, it fails with this error messages:

Runtime Error in seed country (seeds\example\country.csv)
  [TABLE_OR_VIEW_ALREADY_EXISTS] Cannot create table or view `schema_name`.`country` because it already exists.
  Choose a different name, drop or replace the existing object, add the IF NOT EXISTS clause to tolerate pre-existing objects, or add the OR REFRESH clause to refresh the existing streaming table. SQLSTATE: 42P07

When I reinstall dbt-databricks 1.7.14, I can run the command over and over again and the incremental models do create merge statements. I saw there was a fix with rerunning seeds in 1.8.1 but that doesn't solve it for me. Also, we don't have 'persist_doc' set, nor do we have a description for this seed.

Incremental Our incremental models are always run as a 'create or replace' statement instead of a merge after upgrading tot 1.8.0. I see that locally in my target\run folder. I also see it on UC when running 'describe history catalog.schema.table_name' where all recent chnages are 'create or replace' instead of 'merge'.

Steps To Reproduce

Add a seed to the project called country.csv. Use dbt_project.yml to set its target catalog and schema (not sure if that is required). Run dbt build --select country or simply dbt seed; this should work. Then run the same command again and it should fail.

For the incremental models, simply create a model, set it to incremental and add this line: {{ log("For model '"~model.name~"' is_incremental() is set to '"~is_incremental()~"'", True) }} . The second time the model is run, 'is_incremental()' should be 'True' but it is not.

Expected behavior

Automated detection if the table already exists.

Screenshots and log output

See error output above.

System information

The output of dbt --version:

Core:
  - installed: 1.8.1
  - latest:    1.8.1 - Up to date!

Plugins:
  - databricks: 1.8.1 - Up to date!
  - spark:      1.8.0 - Up to date!

The operating system you're using: Windows 11 enterprise The output of python --version: Python 3.11.4

Additional context

I tried to debug the dbt-databricks seeds materialization locally. This line yields 'None' for me where I would expect it to get the relation if the table exists. Weirdly enough, the used parameters (database, schema and identifier) all have the correct value. I see the same thing happening when running from a Databricks workflow. It uses a job cluster so it will get a fresh install of dbt-databricks 1.8.1 on each job run. No other packages installed.

benc-db commented 1 month ago

@thijs-nijhuis-shell can you email me your artifacts? I have not seen any report like this, and I use incremental models with 1.8.x daily. ben.cassell@databricks.com

thijs-nijhuis commented 1 month ago

@benc-db , I think I found the problem.

If I look into the dbt.log I see this query somewhere at the top:

select
      table_name,
      if(table_type in ('EXTERNAL', 'MANAGED', 'MANAGED_SHALLOW_CLONE'), 'table', lower(table_type)) as table_type,
      lower(data_source_format) as file_format,
      table_owner
    from `catalog_name`.`information_schema`.`tables`
    where table_schema = 'schema_name'

I assume that one is used to fetch all the existing objects all at once. But when I run that command I get no results and I think the reason is similar to this discussion and caused by the fact that our catalog was renamed at some point.

When I change the query into this, I do see all the objects:

select
      table_name,
      if(table_type in ('EXTERNAL', 'MANAGED', 'MANAGED_SHALLOW_CLONE'), 'table', lower(table_type)) as table_type,
      lower(data_source_format) as file_format,
      table_owner
    from system.`information_schema`.`tables`
    where table_catalog = 'catalog_name' and  table_schema = 'schema_name'
benc-db commented 1 month ago

Interesting. Would you mind filing a ticket through your company's databricks contact about this issue with information schema, giving the context about how it's breaking dbt? I can code a fallback like I did for MV, but it would be more performant if Databricks just fixed the information_schema behavior to work as expected.