I'm reaching out for @joellabes assistance with an issue we're experiencing with dbt and Snowflake.
We are attempting to use the Snowflake COPY command in dbt to load data from named stage files and have encountered the following scenarios:
When using the COPY command directly in Snowflake, it processes only the latest files. For example, if I have 3 files in the internal stage and all 3 are processed, then 2 new files are added, the COPY command should process only the remaining 2 files as expected as it indicates the same functionality in snowflake UI.
When we try to execute the COPY command using a macro in dbt, the first run processes the initial 3 files, and after adding the 2 new files, the command processes all 5 files again. This behaviour is not as expected.
Tried with different materializations as well when digged into the problem , we're encountering a problem with data duplication.
When we use the "INCREMENTAL" materialization, the data is loaded successfully into the table, but then an INSERT query is performed from a temporary view, duplicating the data. Similarly, when we use the "TABLE" materialization, the table is recreated, wiping out the load metadata, and the data is reloaded, resulting in duplication.
We've identified the root cause of the issue and believe that modifying the macro to skip the unnecessary INSERT query in "INCREMENTAL" materialization or using CREATE TABLE IF NOT EXISTS in "TABLE" mode could resolve the issue.
Could you please provide guidance on how to adapt the queries being sent from dbt to Snowflake to avoid data duplication? We'd appreciate any assistance or recommendations you can offer to resolve this issue. Below is the code used for loading the data from named stage to a target table.
Macro :
{% macro copy_files(stage_name, target_schema,table_name, pattern) %}
COPY INTO {{ target_schema }}.{{ table_name }}
FROM @{{ stage_name }}
pattern = '{{ pattern }}'
FILE_FORMAT = (FORMAT_NAME= 'my_csv_format',error_on_column_count_mismatch=false)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE
force=false;
{% endmacro %}
sql:
{{
config(
materialized= 'table', or 'incremental'
target_table = 'active_teller_accounttypes_full',
I'm reaching out for @joellabes assistance with an issue we're experiencing with dbt and Snowflake. We are attempting to use the Snowflake COPY command in dbt to load data from named stage files and have encountered the following scenarios:
When using the COPY command directly in Snowflake, it processes only the latest files. For example, if I have 3 files in the internal stage and all 3 are processed, then 2 new files are added, the COPY command should process only the remaining 2 files as expected as it indicates the same functionality in snowflake UI.
When we try to execute the COPY command using a macro in dbt, the first run processes the initial 3 files, and after adding the 2 new files, the command processes all 5 files again. This behaviour is not as expected.
Tried with different materializations as well when digged into the problem , we're encountering a problem with data duplication.
When we use the "INCREMENTAL" materialization, the data is loaded successfully into the table, but then an INSERT query is performed from a temporary view, duplicating the data. Similarly, when we use the "TABLE" materialization, the table is recreated, wiping out the load metadata, and the data is reloaded, resulting in duplication.
We've identified the root cause of the issue and believe that modifying the macro to skip the unnecessary INSERT query in "INCREMENTAL" materialization or using CREATE TABLE IF NOT EXISTS in "TABLE" mode could resolve the issue.
Could you please provide guidance on how to adapt the queries being sent from dbt to Snowflake to avoid data duplication? We'd appreciate any assistance or recommendations you can offer to resolve this issue. Below is the code used for loading the data from named stage to a target table. Macro : {% macro copy_files(stage_name, target_schema,table_name, pattern) %}
COPY INTO {{ target_schema }}.{{ table_name }} FROM @{{ stage_name }} pattern = '{{ pattern }}' FILE_FORMAT = (FORMAT_NAME= 'my_csv_format',error_on_column_count_mismatch=false) MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE force=false;
{% endmacro %}
sql:
{{ config( materialized= 'table', or 'incremental' target_table = 'active_teller_accounttypes_full',
}}
select * from {{this}}