duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
790 stars 70 forks source link

Add workaround for temporary tables in remote database when running incremental model #326

Closed guenp closed 5 months ago

guenp commented 5 months ago

A few months ago an issue was raised in the MotherDuck community Slack that CREATE TEMPORARY TABLE when using incremental models was causing slowdowns on dbt.

Summary of the underlying issue:

This PR contains a quick suggestion for a workaround until we support temporary tables in MotherDuck. It creates a "mock" temporary table in the database and deletes it afterwards. Some thoughts for @jwills and @nicku33:

jwills commented 5 months ago

The approach here makes sense to me-- how are we testing this out?

guenp commented 5 months ago

The approach here makes sense to me-- how are we testing this out?

Well, I'm working on adding a unit test for this, based on the manual test I ran. I noticed there are test files test_motherduck.py and test_incremental.py, but they're still empty. I think it's out of scope for this PR to implement the test_incremental.py fully for all backends. I can add it to test_motherduck.py for now. Does that sound good?

Also, I saw that the md tox environment is currently not tested in CI/CD (for obvious reasons), so I'll just aim for a local pass if that works for you.

jwills commented 5 months ago

That sounds great— and yes, I trust that you will not break your own product here. 😉

On Tue, Jan 30, 2024 at 13:45 Guen Prawiroatmodjo @.***> wrote:

The approach here makes sense to me-- how are we testing this out?

Well, I'm working on adding a unit test for this, based on the manual test I ran. I noticed there are test files test_motherduck.py and test_incremental.py, but they're still empty. I think it's out of scope for this PR to implement the test_incremental.py fully for all backends. I can add it to test_motherduck.py for now. Does that sound good?

Also, I saw that the md tox environment is currently not tested in CI/CD (for obvious reasons), so I'll just aim for a local pass if that works for you.

— Reply to this email directly, view it on GitHub https://github.com/duckdb/dbt-duckdb/pull/326#issuecomment-1917948107, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAECWZE2WYL4XCJEANBPH23YRFSYDAVCNFSM6AAAAABCQZTNHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJXHE2DQMJQG4 . You are receiving this because you were mentioned.Message ID: @.***>

guenp commented 5 months ago

Thanks @jwills for the review! I'll address your comments shortly. I also wanted to note that I talked to @nicku33 about this:

The mock temp table is created in the same database/schema as the target table(s). We might want to change that?

We decided to go with a separate schema dbt_temp for the temp_relations used by incremental models. This way users won't accidentally query them in BI tools.

What are your thoughts?

jwills commented 5 months ago

Thanks @jwills for the review! I'll address your comments shortly. I also wanted to note that I talked to @nicku33 about this:

The mock temp table is created in the same database/schema as the target table(s). We might want to change that?

We decided to go with a separate schema dbt_temp for the temp_relations used by incremental models. This way users won't accidentally query them in BI tools.

What are your thoughts?

I am good with that!

jwills commented 5 months ago

@guenp apologies to pile this on, but would you mind changing this code to also use your new is_motherduck function? Just realized I forgot I had introduced some duplication of this already! https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/environments/local.py#L53

guenp commented 5 months ago

@guenp apologies to pile this on, but would you mind changing this code to also use your new is_motherduck function? Just realized I forgot I had introduced some duplication of this already! https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/environments/local.py#L53

Ah yes, I totally forgot about that, thank you for the reminder! It's done.

I've now pushed all my changes. Mainly what I've added since you last reviewed, in addition to addressing your comments:

The behavior now looks something like this: (1) While running unit test:

image

(2) After unit test is finished and all test-related and "temp" tables are cleaned up:

image

@nicku33 and I decided against dropping dbt_temp in the end because it might interfere with concurrent processes. We also discussed making the name dbt_temp configurable. I was thinking to implement this via the model config, but do you know if there may be a more global way to do configure this?

jwills commented 5 months ago

The model config is the right place to use for setting/overriding the temp schema name, and that setting can be set globally for all models if need be by specifying it in e.g. the dbt_project.yml file; see https://docs.getdbt.com/reference/model-configs#model-specific-configurations

guenp commented 5 months ago

Oops, I didn't actually run the unit tests with the default profile. 😁 It looks like it failed because of the post_model_hook! I need to drop for the day but will push a fix & also implement the temp schema name config.

guenp commented 5 months ago

All checks are green! I also added tests/functional/plugins/test_motherduck.py to the MotherDuck tests, I didn't realize your CI/CD ran against a live version of the service so this will help make sure I don't break our product ;)

jwills commented 5 months ago

Thank you very much @guenp!

guenp commented 5 months ago

Yay! 🎉 Could we hold off on pushing a release until @nicku33 has had the chance to test it on a real-life workload? We just want to make sure this workaround is good on our end as well. Thanks!

jwills commented 5 months ago

of course, will wait for you all to sign off on it first

guenp commented 4 months ago

We had the chance to do a full end to end test with "real life" data so this is signed off on our end! FYI here's what we tested: 1) using a source from an internal share, make a reasonably sized agg model, something on the order of 1M rows for incremental. Run it once to get a datetime agg for, say, 14 days, then modify it in that incremental kicks in for a good size row set and has to update rows 2) same, but for missing rows. modify the target table so that upon incremental refresh, new rows are filled in 3) same but for deleting rows. modify the source (maybe a copy of a bunch of rows) so that the update produces the correct gaps 4) test the above for the two different incremental styles: delete + insert and append