dbt-labs / dbt-snowflake

dbt-snowflake contains all of the code enabling dbt to work with Snowflake
https://getdbt.com
Apache License 2.0
280 stars 172 forks source link

[ADAP-1058] [Feature] Implement refresh strategy for dynamic tables #859

Closed mikealfare closed 2 months ago

mikealfare commented 10 months ago

Is this your first time submitting a feature request?

Describe the feature

Provide an option for analytic engineers to delay the refresh of data until the next scheduled refresh instead of forcing a refresh upon object creation. The primary benefit of this feature is that the entire pipeline can be built and the logic can be validated with needing to run a potentially expensive query to initially populate the dynamic table. The proposed name for this configuration is refresh_strategy, as identified in #603.

Describe alternatives you've considered

No response

Who will this benefit?

This will benefit analytics engineers that heavily use dynamic tables and who want to validate logic in their pipelines without reloading data. This also allows a little more flexibility and control over when data is populated. For example, the engineer may not want to run an expensive query in the middle of the day, but also may not want to wait until overnight to deploy the dynamic table.

Are you interested in contributing this feature?

No response

Anything else?

It's worth noting that Snowflake does not currently support this functionality, making this impossible. However, if Snowflake were to add this, then we can work that into dbt-snowflake. This issue is a placeholder for if/when that happens.

tim-sparq commented 10 months ago

@mikealfare

Please can you explain this in a little more detail?

It's worth noting that Snowflake does not currently support this functionality, making this impossible. However, if Snowflake were to add this, then we can work that into dbt-snowflake. This issue is a placeholder for if/when that happens.

I found this issue while looking for a reason why dbt deployments containing chains of dynamic tables are taking so long to deploy.

The problem, as far as I can tell, is that every dynamic table deployment executed by dbt includes an ALTER DYNAMIC TABLE <blah> REFRESH statement after the dynamic table is created.

What this means is that, given a DAG of dynamic tables like this:

A <- B <- C <- D

where <- indicates a SELECT FROM, every deployment of a dynamic table along the DAG requires the refresh of everything before it, i.e. the deployment process proceeds like this:

  1. Deploy A
  2. Refresh A
  3. Deploy B
  4. Refresh A
  5. Refresh B
  6. Deploy C
  7. Refresh A
  8. Refresh B
  9. Refresh C
  10. Deploy D
  11. Refresh A
  12. Refresh B
  13. Refresh C
  14. Refresh D

My question I suppose is, what is the reason that ALTER DYNAMIC TABLE <blah> REFRESH must ALWAYS be included in the dynamic table deployment script?

mikealfare commented 10 months ago

@mikealfare

Please can you explain this in a little more detail?

It's worth noting that Snowflake does not currently support this functionality, making this impossible. However, if Snowflake were to add this, then we can work that into dbt-snowflake. This issue is a placeholder for if/when that happens.

I found this issue while looking for a reason why dbt deployments containing chains of dynamic tables are taking so long to deploy.

The problem, as far as I can tell, is that every dynamic table deployment executed by dbt includes an ALTER DYNAMIC TABLE <blah> REFRESH statement after the dynamic table is created.

What this means is that, given a DAG of dynamic tables like this:

A <- B <- C <- D

where <- indicates a SELECT FROM, every deployment of a dynamic table along the DAG requires the refresh of everything before it, i.e. the deployment process proceeds like this:

1. Deploy A

2. Refresh A

3. Deploy B

4. Refresh A

5. Refresh B

6. Deploy C

7. Refresh A

8. Refresh B

9. Refresh C

10. Deploy D

11. Refresh A

12. Refresh B

13. Refresh C

14. Refresh D

My question I suppose is, what is the reason that ALTER DYNAMIC TABLE <blah> REFRESH must ALWAYS be included in the dynamic table deployment script?

When dynamic tables were first introduced, they did not automatically refresh. This required us to run the refresh statement to populate them. That statement should be in 1.6 as a result. Then Snowflake decided to always run the initial refresh automatically. We should have removed that statement in 1.7 as a result. The former would have allowed us to control whether a dynamic table was initially refreshed. The latter does not as Snowflake runs this internally. I suspect at some point Snowflake will make this configurable and we can tie into that.

tim-sparq commented 10 months ago

@mikealfare Please can you explain this in a little more detail?

It's worth noting that Snowflake does not currently support this functionality, making this impossible. However, if Snowflake were to add this, then we can work that into dbt-snowflake. This issue is a placeholder for if/when that happens.

I found this issue while looking for a reason why dbt deployments containing chains of dynamic tables are taking so long to deploy. The problem, as far as I can tell, is that every dynamic table deployment executed by dbt includes an ALTER DYNAMIC TABLE <blah> REFRESH statement after the dynamic table is created. What this means is that, given a DAG of dynamic tables like this: A <- B <- C <- D where <- indicates a SELECT FROM, every deployment of a dynamic table along the DAG requires the refresh of everything before it, i.e. the deployment process proceeds like this:

1. Deploy A

2. Refresh A

3. Deploy B

4. Refresh A

5. Refresh B

6. Deploy C

7. Refresh A

8. Refresh B

9. Refresh C

10. Deploy D

11. Refresh A

12. Refresh B

13. Refresh C

14. Refresh D

My question I suppose is, what is the reason that ALTER DYNAMIC TABLE <blah> REFRESH must ALWAYS be included in the dynamic table deployment script?

When dynamic tables were first introduced, they did not automatically refresh. This required us to run the refresh statement to populate them. That statement should be in 1.6 as a result. Then Snowflake decided to always run the initial refresh automatically. We should have removed that statement in 1.7 as a result. The former would have allowed us to control whether a dynamic table was initially refreshed. The latter does not as Snowflake runs this internally. I suspect at some point Snowflake will make this configurable and we can tie into that.

@mikealfare thanks for explaining - I have upgraded to 1.7 and can see that the refresh statement is no longer present in the deployment scripts

sfc-gh-dflippo commented 9 months ago

I have a related recommendation that we should add a refresh when we are not creating/replacing a dynamic table to ensure that any downstream table materializations and tests are querying the current information during the dbt batch run.

I have customers using the new feature and we discovered that switching tables to dynamic table materializations, our downstream table models could be querying stale data because although a downstream dynamic table can trigger an upstream dynamic table, the same is not true for other queries. We have been adding post hooks to trigger the refresh as an interim solution.

I welcome feedback if this was intentional but otherwise, I recommend we add two lines to the snowflake__get_alter_dynamic_table_as_sql macro that calls snowflake__refresh_dynamic_table similar to the existing call in the replace.sql:

{% macro snowflake__get_alter_dynamic_table_as_sql(
    existing_relation,
    configuration_changes,
    target_relation,
    sql
) -%}
    {{- log('Applying ALTER to: ' ~ existing_relation) -}}

    {% if configuration_changes.requires_full_refresh %}
        {{- get_replace_sql(existing_relation, target_relation, sql) -}}

    {% else %}

        {%- set target_lag = configuration_changes.target_lag -%}
        {%- if target_lag -%}{{- log('Applying UPDATE TARGET_LAG to: ' ~ existing_relation) -}}{%- endif -%}
        {%- set snowflake_warehouse = configuration_changes.snowflake_warehouse -%}
        {%- if snowflake_warehouse -%}{{- log('Applying UPDATE WAREHOUSE to: ' ~ existing_relation) -}}{%- endif -%}

        alter dynamic table {{ existing_relation }} set
            {% if target_lag %}target_lag = '{{ target_lag.context }}'{% endif %}
            {% if snowflake_warehouse %}warehouse = {{ snowflake_warehouse.context }}{% endif %}
        ;
        {{ snowflake__refresh_dynamic_table(target_relation) }}

    {%- endif -%}

{%- endmacro %}

I also suggest that we should update the snowflake__dynamic_table_get_build_sql macro to perform a refresh as the default when there are no other changes:

{% macro snowflake__dynamic_table_get_build_sql(existing_relation, target_relation) %}

    {% set full_refresh_mode = should_full_refresh() %}

    -- determine the scenario we're in: create, full_refresh, alter, refresh data
    {% if existing_relation is none %}
        {% set build_sql = get_create_sql(target_relation, sql) %}
    {% elif full_refresh_mode or not existing_relation.is_dynamic_table %}
        {% set build_sql = get_replace_sql(existing_relation, target_relation, sql) %}
    {% else %}

        -- get config options
        {% set on_configuration_change = config.get('on_configuration_change') %}
        {% set configuration_changes = snowflake__get_dynamic_table_configuration_changes(existing_relation, config) %}

        {% if configuration_changes is none %}
            {% set build_sql = snowflake__refresh_dynamic_table(target_relation) %}
            {{ exceptions.warn("No configuration changes were identified on: `" ~ target_relation ~ "`. Refreshing.") }}

        {% elif on_configuration_change == 'apply' %}
            {% set build_sql = snowflake__get_alter_dynamic_table_as_sql(existing_relation, configuration_changes, target_relation, sql) %}
        {% elif on_configuration_change == 'continue' %}
            {% set build_sql = snowflake__refresh_dynamic_table(target_relation) %}
            {{ exceptions.warn("Configuration changes were identified and `on_configuration_change` was set to `continue` for `" ~ target_relation ~ "`. Refreshing.") }}
        {% elif on_configuration_change == 'fail' %}
            {{ exceptions.raise_fail_fast_error("Configuration changes were identified and `on_configuration_change` was set to `fail` for `" ~ target_relation ~ "`") }}

        {% else %}
            -- this only happens if the user provides a value other than `apply`, 'continue', 'fail'
            {{ exceptions.raise_compiler_error("Unexpected configuration scenario: `" ~ on_configuration_change ~ "`") }}

        {% endif %}

    {% endif %}

    {% do return(build_sql) %}

{% endmacro %}
sfc-gh-yostrinsky commented 9 months ago

@tim-sparq @mikealfare

I suspect at some point Snowflake will make this configurable and we can tie into that.

It's now a part of the create statement, there is a new "INITIALIZE" parameter that sets if the DT should be refreshed on creation or not. Can this be added as a configuration (with the lag and warehouse)? and remove any "DBT logic" around the refresh? BTW, there is also a new parameter to enforce a refresh mode, can this be added as well? https://docs.snowflake.com/en/sql-reference/sql/create-dynamic-table

Thanks!

mikealfare commented 9 months ago

@Fleid I want to bring this to your attention as I slotted this for refinement. Let us know where this falls regarding prioritization. Also, while I was reviewing the docs linked above, I noticed that there's also a refresh mode option now (separate feature).

yanithx commented 3 months ago

@tim-sparq Snowflake is changing the behavior of the default refresh mode and they suggest explicitly setting the refresh mode on all production dynamic tables. Could we add the option to specify the refresh_mode on the dynamic table config?

mikealfare commented 3 months ago

I put this back into triage because I think it's actually completed. We added refresh_mode as a config, which is equivalent to this feature. I would like Product to chime in (or close this as done) in case I'm missing something.

dataders commented 2 months ago

@mikealfare I'm happy to close this as completed by #1081. sounds like the thing that cannot still be done is to create a DT that does not auto-populate (i.e. without a refresh right away).

Folks are welcome to comment on this issue should they still want this