Incorrect identifiers in snapshot schema test when overriding the ref() macro

MartinGuindon commented 3 years ago

Describe the bug

When overriding the ref() macro to render identifiers without a database (as documented here), schema tests defined for snapshots in a snapshot properties file are rendered without a database.

As snapshots are "static" based on the snapshot configuration and not environment-based, I'm a bit surprised about this behavior. I would have assumed that schema tests on snapshots would be unaffected by ref() override, just like schema tests on sources are unaffected.

I therefore assume that this is a bug.

Steps To Reproduce

Override the ref() macro by using the documented alternative:


-- Macro to override ref and to render identifiers without a database.

{% macro ref(model_name) %}

{% do return(builtins.ref(model_name).include(database=false)) %}

{% endmacro %}


2. Define a schema test in a snapshot:

version: 2

snapshots:

name: orders_snapshot columns:
- name: id tests:
  - not_null

Inspect the compiled test: the FROM clause is rendered without a database, and therefore would potentially fail as snapshots are often stored in a different database than the dev or prod environment, especially on Snowflake.

Expected behavior

Schema tests for snapshots should ignore the ref() function configuration and simply point to the configured database/schema of the specific snapshot.

System information

Which database are you using dbt with?

[ ] postgres
[ ] redshift
[ ] bigquery
[x] snowflake
[ ] other (specify: ____)

The output of dbt --version:

installed version: 0.18.1
   latest version: 0.18.1

Up to date!

The operating system you're using: MacOS 10.15.7

The output of python --version: Python 3.7.6

jtcohen6 commented 3 years ago

@MartinGuindon Thanks for the detailed writeup!

I would have assumed that schema tests on snapshots would be unaffected by ref() override, just like schema tests on sources are unaffected.

The reason here is simple: you use the ref() macro to reference snapshots, and the source() macro to reference sources. You're overriding the ref() macro but not the source() macro.

I don't think we should make a dbt change here: if you construct your own ref() macro on top of builtins.ref, and then proceed call the ref() macro, dbt shouldn't attempt to subvert your custom definition. I do think you have a few potential approaches here:

Define a custom ref macro, just for referencing snapshots.


{% macro ref_snapshot(snapshot_name) %}

{% do return(builtins.ref(model_name)) %}

{% endmacro %}

Use `{{ ref_snapshot(snapshot_name) }}` in lieu of `{{ ref('snapshot_name) }} ` throughout your project.

2. Use a common naming convention for your snapshots, and add conditional logic to your custom `ref` function.
```sql
{% macro ref(model_name) %}

  {% if model_name.startswith('snap_') %}

    {% do return(builtins.ref(model_name)) %}

  {% else %}

    {% do return(builtins.ref(model_name).include(database=false)) %}

  {% endif %}

{% endmacro %}

I'm going to close this, but I remain curious to hear what you think!

MartinGuindon commented 3 years ago

Hi @jtcohen6,

I'm confused. I'm not explicitly using the ref() function, as I'm not talking about a model leveraging a snapshot. If I were to use the ref() function in a model, pointing to a snapshot, I fully understand that it would be totally normal (and hence why snapshots tables are configured as sources prior to using them in models).

I'm talking about defining a schema test like not_null within in the snapshot properties YAML file. Its the value that is returned to the schema test through the {{ model }} argument that is incorrect. I imagine that its using the ref() function under the hood even though its not apparent in the schema test.

So option 1 is not possible. Option 2 works, I just tested it. However, I feel like this is a workaround and not a long term solution. Shouldn't schema tests set to a snapshot render properly to the configured database/schema, since those are not dynamic unlike the models?

jtcohen6 commented 3 years ago

Wow! I'm sorry, I completely skipped over the fact that this is about schema tests. Let me re-open and think about this some more.

jtcohen6 commented 3 years ago

@MartinGuindon Apologies again for misunderstanding your original issue. This is... tricky!

I admit it feels a bit weird that overriding the builtin ref macro changes the behavior of how all non-source relations are rendered, specifically {{ model }} in schema test definitions. I do think it's the intended behavior, however. We could either:

Create special handling for snapshots, similar to what we have for sources. We create relations differently for a resource, depending on whether it's a source vs. any other node type.
Write a more complex version of the custom ref macro to handle this conditional, along the lines of option #2 above. This is what feels most right to me: overriding a builtin is already a high-difficulty move, and there should be knobs to turn in user-land that give the dbt developer the ability to tweak this behavior.

In most projects, snapshots live in one or more stable, set-aside databases. Is that true in your case as well? Given that, here's the best answer I've got right now:

{% macro ref(model_name) %}

  {% set rel = builtins.ref(model_name) %}
  {% do log(rel.values(), info = true) %}

  {% if rel.database == 'snapshots' %}
    {% do return(builtins.ref(model_name)) %}
  {% else %}
    {% do return(builtins.ref(model_name).include(database=false)) %}
  {% endif %}

{% endmacro %}

This worked when I tested it locally, running schema tests against snapshots configured with target_database = 'snapshots'.

What I'd love to do here is make the conditional check {% if rel.resource_type == 'snapshot' %}. I don't believe that's possible today; although it directly corresponds to resources like models and snapshots, the dbt Relation object itself (which ref() returns) only contains information about database representations. It feels plausible, though. I'm not sure how tricky it would be to implement.

MartinGuindon commented 3 years ago

@jtcohen6 I'd think that a special handling for snapshots would be ideal, but in the mean time using {% if rel.database == 'snapshots' %} works for me, since we're on Snowflake, we do have a dedicated database for snapshots.

github-actions[bot] commented 2 years ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

dbt-labs / dbt-core