dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
294 stars 119 forks source link

get_external_build_plan does not use the 'source' macro as 'create_external_table' does #201

Closed thijsnijhuis-schiphol closed 10 months ago

thijsnijhuis-schiphol commented 1 year ago

Describe the bug

The 'spark__get_external_build_plan' macro does not use the 'source' macro to see of the table already exists. The package does use it in 'spark__create_external_table' to create the table. Therefore, when overruling the source object naming in 'source', the package does create the right objects but it fails to correctly identify if it is already there. If I set the 'ext_full_refresh' to true, it will not generate error but it will always drop/create all tables.

Steps to reproduce

Use azure databricks and create create external table simply pointing the location to a storage account path. Add a 'source' macro to your project and make some change to the default naming of the source objects. In our case, we always want our sources to be named: src___ and be created in the same schema as all other tables. I have created this macro tod o so:

{% macro source(source_name, table_name) %}
  {# This macro will make sure that external tables to sources are generated as: schema_name.src_<source_name>__<source_table_name> #}

  {% set rel = builtins.source(source_name, table_name) %}

  {# Not all sources need external tables. These are only added when a source has the 'external' property specified. If this not the case, we do not need to overrule the naming and keep the default #}
  {%- if graph.sources -%}
    {# The only way tot determine whether 'external' property is specified is by querying the graph #}
    {%- set source_node = graph.sources.values() | selectattr("source_name", "eq", source_name) | selectattr("name", "eq", table_name) | first() -%}
    {%- if source_node and source_node.external -%}
      {%- set rel = rel.replace_path(schema=target.schema).replace_path(identifier="src_"~source_name~"__"~table_name) -%}
    {%- endif -%} 
  {%- endif -%} 

  {# Return the result, altered or not. #}
  {% do return(rel) %}

{% endmacro %}

Expected results

The macro that check to see if the table exist should use the same naming as the macro that created the table

Actual results

The macro that checks the exists uses a different table name tan the one that creates them when 'source' is used to overrule the name.

Screenshots and log output

n.a.

System information

The contents of your packages.yml file:

Which database are you using dbt with? databricks/spark

The output of dbt --version:

Core:
  - installed: 1.4.5
  - latest:    1.5.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.4.3 - Up to date!
  - spark:      1.4.1 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using: Windows 11

The output of python --version: Python 3.10.8

Additional context

I solved this myself by chaing line 7 and 8 of macros \ plugins \ spark \get_external_build_plan.sql from :

    schema = source_node.schema,
    identifier = source_node.identifier

into:

    schema = source(source_node.source_name, source_node.name).schema,
    identifier = source(source_node.source_name, source_node.name).identifier
github-actions[bot] commented 10 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 10 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.