databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
226 stars 119 forks source link

Help with an example of the materialization type `clone` #832

Open johnson-jay-l opened 1 month ago

johnson-jay-l commented 1 month ago

Describe the bug

I would like to try using the materialization type of clone so that I can have a model that "publishes" tested data runs downstream from the actual model being tested. The clone should have the same table name as the upstream model but should be in a different database schema.

This materialization type is not documented and I can't figure out how to get it to work for my use case.

Code for the clone materialization: https://github.com/databricks/dbt-databricks/blob/main/dbt/include/databricks/macros/materializations/clone.sql

Steps To Reproduce

Expected behavior

Screenshots and log output

n/a

System information

The output of dbt --version:

dbt@06f0d6eceaea:~$ dbt --version
Core:
  - installed: 1.8.6

  The latest version of dbt-core could not be determined!
  Make sure that the following URL is accessible:
  https://pypi.org/pypi/dbt-core/json

Plugins:
  - redshift:   1.8.1 - Up to date!
  - databricks: 1.8.5 - Update available!
  - spark:      1.8.0 - Up to date!
  - postgres:   1.8.2 - Up to date!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using:

The output of python --version:

dbt@06f0d6eceaea:~$ python --version
Python 3.9.20

Additional context

johnson-jay-l commented 1 month ago

Here's what I've tried:

test_table.sql

{{
  config(
    materialized='table'
    )
}}

select 1 as col1
union
select 2 as col2

test_table_clone.sql

{{
  config(
    materialized='clone',
    this_relation='test_table',
    defer_relation='test_table',
    schema='bi'
    )
}}

dbt run --target databricks-dev-local --select test_table

04:36:00  ========== Begin Summary ==========
04:36:00  node: model.calm_main.test_table; status: success (error: )
04:36:00  ========== End Summary ==========
04:36:00  1 of 1 START hook: calm_main.on-run-end.0 ...................................... [RUN]
04:36:00  1 of 1 OK hook: calm_main.on-run-end.0 ......................................... [OK in 0.00s]
04:36:00  
04:36:01  
04:36:01  Finished running 1 table model, 1 project hook in 0 hours 0 minutes and 30.69 seconds (30.69s).
04:36:03  
04:36:03  Completed successfully
04:36:03  
04:36:03  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

dbt run --target databricks-dev-local --select test_table_clone

04:36:25  Found 1254 models, 40 snapshots, 23 seeds, 1 operation, 2795 data tests, 1773 sources, 1050 macros
04:36:25  
04:36:34  Concurrency: 4 threads (target='databricks-dev-local')
04:36:34  
04:36:34  1 of 1 START sql clone model sandbox.test_table_clone .................. [RUN]
04:36:34  No relation found in state manifest for model.calm_main.test_table_clone
04:36:34  1 of 1 ERROR creating sql clone model sandbox.test_table_clone ......... [ERROR in 0.04s]

Question - why is it looking for the state manifest? Is there a way to ignore the state manifest and create the clone as long as the upstream task was successful? We often have downstream models running in separate container environments that don't have awareness of the upstream models' state manifests because the success/failure state is managed outside of dbt.

benc-db commented 1 month ago

I'm reasonably certain that we don't actually support clone as a materialization, but as support for dbt clone

johnson-jay-l commented 1 month ago

In theory it seems like a custom materialization could work that uses cloning. I got pretty far with this example below based on table.sql and then calling clone() from a model.sql file instead of using the standard config block.

Would this general direction potentially work as a model materialization type? Or, what would it take to make the clone.sql logic work with models and not just the CLI? Looking for guidance on which direction makes the most sense.

{% macro clone(source_model_name, source_schema, schema, tags, pre_hook, post_hook) -%}
    {%- set source_model_name = source_model_name -%}
    {%- set source_schema = source_schema -%}
    {%- set schema = schema -%}
    {%- set tags = tags | default([]) -%}
    {%- set pre_hook = pre_hook | default('') -%}
    {%- set post_hook = post_hook | default('') -%}

    {{
      config(
            materialized = 'databricks_clone',
            source_model_name = source_model_name,
            source_schema = source_schema,
            schema = schema,
            tags = tags,
            pre_hook = pre_hook,
            post_hook = post_hook
      )
    }}
    -- depends_on: {{ ref(source_model_name) }}

{%- endmacro %}

{% materialization databricks_clone, adapter = 'databricks' %}
  {{ log("MATERIALIZING CLONE") }}
  {%- set source_model_name = config.get('source_model_name') -%}
  {%- set source_schema = config.get('source_schema') -%}
  {%- set target_schema = schema -%}
  {%- set grant_config = config.get('grants') -%}
  {%- set tags = config.get('databricks_tags') -%}

  {{ log("database: " ~ database, info=True) }}
  {{ log("source_model_name: " ~ source_model_name, info=True) }}
  {{ log("source_schema: " ~ source_schema, info=True) }}
  {{ log("target_schema: " ~ schema, info=True) }}

  {%- set target_relation = adapter.get_relation(database=database, schema=target_schema, identifier=source_model_name, needs_information=True) -%}
  {{ log("target_relation: " ~ target_relation, info=True) }}

  {{ log(target_relation, info=True) }}

  {%- set source_relation = adapter.get_relation(database=database, schema=source_schema, identifier=source_model_name, needs_information=True) -%}

  {{ log("source_relation: " ~ source_relation, info=True) }}

  {{ run_hooks(pre_hooks) }}

  -- build model

  {%- call statement('main', language='sql') -%}
    create or replace
    table {{ target_relation }}
    shallow clone {{ source_relation }}
  {%- endcall -%}

  {% set should_revoke = should_revoke(old_relation, full_refresh_mode=True) %}
  {% do apply_grants(target_relation, grant_config, should_revoke) %}
  {%- do apply_tags(target_relation, tags) -%}

  {% do persist_docs(target_relation, model, for_relation=language=='python') %}

  {% do persist_constraints(target_relation, model) %}

  {% do optimize(target_relation) %}

  {{ run_hooks(post_hooks) }}

  {{ return({'relations': [target_relation]})}}

{% endmaterialization %}
benc-db commented 1 month ago

First we'd have to remove the blockers to custom materializations (another open ticket) :/. I suggest raising the issue of clone as a supported materialization with dbt Labs, because your use case seems reasonable to me.

johnson-jay-l commented 1 month ago

I'm new to custom materializations and was going off this doc which implies that custom materializations are supported: https://docs.getdbt.com/guides/create-new-materializations?step=1

Could you elaborate on the blockers? Happy to open another ticket and/or contribute to the code but I need to better understand what I should be asking for vs what is already supported.

johnson-jay-l commented 1 month ago

I went ahead and created the dbt-core issue and linked it to this one

benc-db commented 1 month ago

There are places in dbt-databricks that explicitly enumerate supported materializations that we need to remove.