dbt-labs / dbt-snowflake

dbt-snowflake contains all of the code enabling dbt to work with Snowflake
https://getdbt.com
Apache License 2.0
285 stars 174 forks source link

[Feature] Concurrency issues in snowflake incremental models. #1123

Open rattata2me opened 3 months ago

rattata2me commented 3 months ago

Is this a new bug in dbt-snowflake?

Current Behavior

When running an incremental table with the on_schema_change policy set to append_new_columns, a race condition can occur. If two jobs perform the column check at the same time, the same ALTER TABLE statement will be generated, resulting in a SQL compilation error on the slower executor:

SQL Compilation error:
ERROR
  column '<new_column>' already exists

The core issue is that we cannot ensure the column schema remains unchanged between the column check operation and the ALTER TABLE statement.

Expected Behavior

To avoid this issue, the process should either utilize the IF NOT EXISTS and IF EXISTS conditions provided by Snowflake or make use of transactions so that column checkup and update is atomic.

Steps To Reproduce

  1. Create an Incremental Table: Define and create an incremental table in your dbt project.
  2. Add a Column: Add a new column entry to the model.
  3. Concurrently Execute dbt run: Execute the dbt process in two separate jobs concurrently.

Relevant log output

No response

Environment

- OS:Ubuntu 22.04
- Python: 3.9
- dbt-core: 1.5.7
- dbt-snowflake: 1.5.7

Additional Context

No response

amychen1776 commented 2 months ago

@rattata2me Could you say more about your use case here? Is there a reason why you would want to have concurrent execution against the same target (especially given that you are changing the target table schema with a new column).

rattata2me commented 2 months ago

@rattata2me Could you say more about your use case here? Is there a reason why you would want to have concurrent execution against the same target (especially given that you are changing the target table schema with a new column).

I have to dynamically update the table contents every time there is a new data ingestion, the data ingestion can be concurrent; coming from different users at the same time. This is a critical process which has to have 100% up time so I depend on the logic of on schema change to roll out column changes while keeping the service running.

amychen1776 commented 2 months ago

Thank you for the explanation!