Closed guenp closed 5 months ago
The approach here makes sense to me-- how are we testing this out?
The approach here makes sense to me-- how are we testing this out?
Well, I'm working on adding a unit test for this, based on the manual test I ran. I noticed there are test files test_motherduck.py
and test_incremental.py
, but they're still empty. I think it's out of scope for this PR to implement the test_incremental.py
fully for all backends. I can add it to test_motherduck.py
for now. Does that sound good?
Also, I saw that the md
tox
environment is currently not tested in CI/CD (for obvious reasons), so I'll just aim for a local pass if that works for you.
That sounds great— and yes, I trust that you will not break your own product here. 😉
On Tue, Jan 30, 2024 at 13:45 Guen Prawiroatmodjo @.***> wrote:
The approach here makes sense to me-- how are we testing this out?
Well, I'm working on adding a unit test for this, based on the manual test I ran. I noticed there are test files test_motherduck.py and test_incremental.py, but they're still empty. I think it's out of scope for this PR to implement the test_incremental.py fully for all backends. I can add it to test_motherduck.py for now. Does that sound good?
Also, I saw that the md tox environment is currently not tested in CI/CD (for obvious reasons), so I'll just aim for a local pass if that works for you.
— Reply to this email directly, view it on GitHub https://github.com/duckdb/dbt-duckdb/pull/326#issuecomment-1917948107, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAECWZE2WYL4XCJEANBPH23YRFSYDAVCNFSM6AAAAABCQZTNHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJXHE2DQMJQG4 . You are receiving this because you were mentioned.Message ID: @.***>
Thanks @jwills for the review! I'll address your comments shortly. I also wanted to note that I talked to @nicku33 about this:
The mock temp table is created in the same database/schema as the target table(s). We might want to change that?
We decided to go with a separate schema dbt_temp
for the temp_relation
s used by incremental models. This way users won't accidentally query them in BI tools.
What are your thoughts?
Thanks @jwills for the review! I'll address your comments shortly. I also wanted to note that I talked to @nicku33 about this:
The mock temp table is created in the same database/schema as the target table(s). We might want to change that?
We decided to go with a separate schema
dbt_temp
for thetemp_relation
s used by incremental models. This way users won't accidentally query them in BI tools.What are your thoughts?
I am good with that!
@guenp apologies to pile this on, but would you mind changing this code to also use your new is_motherduck
function? Just realized I forgot I had introduced some duplication of this already! https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/environments/local.py#L53
@guenp apologies to pile this on, but would you mind changing this code to also use your new
is_motherduck
function? Just realized I forgot I had introduced some duplication of this already! https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/environments/local.py#L53
Ah yes, I totally forgot about that, thank you for the reminder! It's done.
I've now pushed all my changes. Mainly what I've added since you last reviewed, in addition to addressing your comments:
DuckDBAdapter.get_temp_relation_path(model)
. This method is used in the incremental.sql
macro and will create a new unique name for a temp_relation
that is to be dropped at the end of the macro, with identifier dbt_temp.<schema_name>__<target_relation_name>
.
DuckDBAdapter.post_model_hook(config, context)
. This method fixes a memory leak issue that arises if the incremental macro doesn't run successfully (e.g. when it runs into a CompilationError
as it does in this test) and never reaches the lines where it deletes the temp_relation
. The method makes sure to drop the temp_relation
if it still exists after the model failed (see the try-except
block that intercepts this here).
The behavior now looks something like this: (1) While running unit test:
(2) After unit test is finished and all test-related and "temp" tables are cleaned up:
@nicku33 and I decided against dropping dbt_temp
in the end because it might interfere with concurrent processes.
We also discussed making the name dbt_temp
configurable. I was thinking to implement this via the model config
, but do you know if there may be a more global way to do configure this?
The model config
is the right place to use for setting/overriding the temp schema name, and that setting can be set globally for all models if need be by specifying it in e.g. the dbt_project.yml
file; see https://docs.getdbt.com/reference/model-configs#model-specific-configurations
Oops, I didn't actually run the unit tests with the default profile. 😁
It looks like it failed because of the post_model_hook
! I need to drop for the day but will push a fix & also implement the temp schema name config.
All checks are green! I also added tests/functional/plugins/test_motherduck.py
to the MotherDuck tests, I didn't realize your CI/CD ran against a live version of the service so this will help make sure I don't break our product ;)
Thank you very much @guenp!
Yay! 🎉 Could we hold off on pushing a release until @nicku33 has had the chance to test it on a real-life workload? We just want to make sure this workaround is good on our end as well. Thanks!
of course, will wait for you all to sign off on it first
We had the chance to do a full end to end test with "real life" data so this is signed off on our end!
FYI here's what we tested:
1) using a source from an internal share, make a reasonably sized agg model, something on the order of 1M rows for incremental. Run it once to get a datetime agg for, say, 14 days, then modify it in that incremental kicks in for a good size row set and has to update rows
2) same, but for missing rows. modify the target table so that upon incremental refresh, new rows are filled in
3) same but for deleting rows. modify the source (maybe a copy of a bunch of rows) so that the update produces the correct gaps
4) test the above for the two different incremental styles: delete + insert
and append
A few months ago an issue was raised in the MotherDuck community Slack that
CREATE TEMPORARY TABLE
when using incremental models was causing slowdowns ondbt
.Summary of the underlying issue:
duckdb
instance, all temporary tables are created locallydbt
incremental models create a temporary table for appending tuples to an existing table"md:my_database"
), a new table is created locally, tuples are transported from the server into the new local table, and then sent back to the server when the finalINSERT INTO
is executed. This adds a nontrivial round trip time causing said slowdown.This PR contains a quick suggestion for a workaround until we support temporary tables in MotherDuck. It creates a "mock" temporary table in the database and deletes it afterwards. Some thoughts for @jwills and @nicku33:
temp
.dbt
to run successfully after a mock temp table is created