airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.93k stars 4.09k forks source link

[destination-snowflake] tables are created with `CHANGE_TRACKING = FALSE` #34322

Open dtiesling opened 9 months ago

dtiesling commented 9 months ago

Connector Name

destination-snowflake

Connector Version

3.4.22

What step the error happened?

During the sync

Relevant information

In my Snowflake data pipeline I am using Dynamic Tables on top of the destination tables created by Airbyte. The dynamic tables rely on CHANGE_TRACKING being enabled. Each time I trigger a "full overwrite" sync the new Airbyte tables have CHANGE_TRACKING disabled and it breaks the Dynamic Table.

Snowflake is pushing the Dynamic Table feature heavily so I suspect this will start to impact more and more users.

It looks like it could possibly be a one-line update to the sql statement that creates tables in the connector if this does not need to be configurable on the destination.

Relevant log output

No response

Contribute

marcosmarxm commented 9 months ago

It looks the default method to create tables in Snowflake uses the False flag for CHANGE_TRACKING and the solution is to implement the ALTER TABLE t1 SET CHANGE_TRACKING = TRUE; (Snowflake documentation example page). I'm not familiar with this feature and the impact it can cause in the current connector implementation. I asked to the destination team take a look, thanks for opening the discussion @dtiesling to make the connector better 🥇

evantahler commented 9 months ago

As a note, the reason this happens is that a full refresh sync is accomplished by creating a new table each sync, so that while we are loading up the new table, the old table and data is still useable. The last step of the sync is a begin; drop table old, rename table new to old; commit

evantahler commented 9 months ago

@dtiesling - do you happen to have any insight into how much more Snowflake will charge to enable change tracking? It looks like under the hood this would add 2 'virtual' columns to the table, so there would be increase storage cost... but what about compute cost to calculate this data on write?

Furthermore, with the underlying tables actually being dopped under the hood, would change tracking work?

dtiesling commented 9 months ago

@evantahler Change tracking will work since the dynamic table just require that it's enabled. If there is no history they will just do a full write.

That being said it would be good to confirm that the dynamic table still works after the drop/rename. I'm not 100% confident it won't throw a different error after CHANGE_TRACKING is flipped back on. I can do some testing next week. If you'd like to do any testing on your side here's a simple set of commands to create a dynamic table and test the refresh.

CREATE OR REPLACE DYNAMIC TABLE dynamic_table_test as SELECT * FROM <airbyte_table>;
ALTER DYNAMIC TABLE dynamic_table_test REFRESH;

re: pricing - The change tracking is a cloud service layer feature so there should be no metered cost associated. There would only be the marginal increase in storage cost for the data.

dtiesling commented 3 weeks ago

Sorry to have let this die. We did confirm that the dynamic tables will continue to work when the table is dropped and remade. Our current workaround is that we have a snowflake job cron'd that enables change tracking on all of the airbyte tables. Is there any appetite to add an option to airbyte to enable change tracking or just set it true on all of the tables?

csechrist commented 2 weeks ago

Hey all! I just did a quick PR to add it in as an option within Airbyte. This is something that our team really needed as well, and like was said above, I think that Snowflake is going to push it even more!