databricks / dbt-databricks

A dbt adapter for Databricks.
https://databricks.com
Apache License 2.0
226 stars 119 forks source link

Extend Merge Capabilities #739

Closed mi-volodin closed 3 months ago

mi-volodin commented 4 months ago

Resolves #245 Resolves #645
Resolves #707

Motivation

The current MERGE INTO capabilities available in dbt (data build tool) significantly differ from what can be achieved through a pure SQL interface. While it is sometimes possible to circumvent these limitations using complex SQL techniques, this approach often increases code complexity. Additionally, because Spark does not execute some operations identically when comparing direct merges with workaround solutions, this can lead to performance issues.

Description

This PR aims to add the following capabilities:

DOD checklist

Tests

Open questions

Docs

Checklist

mi-volodin commented 3 months ago

ā— My local tests are failing mostly on the snapshot steps. And the reason for it concealed in dbt base tests, which I found buggy.

It works like this:

  1. First run we generate the snapshot from scratch, populating the dbt_ fields. Also dbt_unique_id.

  2. Next run when we update snapshot dbt core snapshot logic (see dbt/include/global_project/macros/materializations/snapshots/helpers.sql) executes the following

    snapshotted_data as (
    
        select *,
            {{ strategy.unique_key }} as dbt_unique_key
    
        from {{ target_relation }}
        where dbt_valid_to is null
    
    ),
  3. If we look in the tests definition of dbt, the unique_key field will be id of a seed. (l.123 dbt/tests/adapter/basic/files.py). And therefore the following sql is generated during tests

    snapshotted_data as (
    
        select *,
            id as dbt_unique_key
    
        from [EXISTING_SNAPSHOT]
        where dbt_valid_to is null
    
    ),

It cannot be resolved, because snapshotted_data now has duplicated dbt_unique_key attribute, and when it is referenced below - query fails.

benc-db commented 3 months ago

ā— My local tests are failing mostly on the snapshot steps. And the reason for it concealed in dbt base tests, which I found buggy.

It works like this:

  1. First run we generate the snapshot from scratch, populating the dbt_ fields. Also dbt_unique_id.
  2. Next run when we update snapshot dbt core snapshot logic (see dbt/include/global_project/macros/materializations/snapshots/helpers.sql) executes the following

    snapshotted_data as (
    
       select *,
           {{ strategy.unique_key }} as dbt_unique_key
    
       from {{ target_relation }}
       where dbt_valid_to is null
    
    ),
  3. If we look in the tests definition of dbt, the unique_key field will be id of a seed. (l.123 dbt/tests/adapter/basic/files.py). And therefore the following sql is generated during tests

    snapshotted_data as (
    
       select *,
           id as dbt_unique_key
    
       from [EXISTING_SNAPSHOT]
       where dbt_valid_to is null
    
    ),

It cannot be resolved, because snapshotted_data now has duplicated dbt_unique_key attribute, and when it is referenced below - query fails.

Can we clarify, is the test buggy, or our implementation? Is this something we should raise to dbt Labs?

benc-db commented 3 months ago

@mi-volodin if it's not too painful, can you rebase against 1.9.latest branch next week? This feature (and a couple of others I've worked on) is big enough to increment feature version. I'll get 1.9.latest updated today to include latest changes from main.

mi-volodin commented 3 months ago

@benc-db sure thing. TBH never done such operation before, so would be curious how painful is this šŸ˜„

Let me know when I can start. I see neither tag nor branch with 1.9... Maybe I am looking in a wrong place. Anyway, I'd appreciate if you ping me once the time comes.

mi-volodin commented 3 months ago

Can we clarify, is the test buggy, or our implementation? Is this something we should raise to dbt Labs?

For now it looks purely as dbt side bug. Which is suspicuous... In dbt-databricks both the snapshot code and the tests are taken from dbt-core. So let me finish with the development and I will finalise the investigation.

I planned to raise it in dbt-core, wrote this note here just to explain that it is not related to my changes.

mi-volodin commented 3 months ago

@benc-db rebased

mi-volodin commented 3 months ago

@benc-db sorry, rebased, but not tested yet. Let me fix some issues and finalise the tests before running GH actions.

benc-db commented 3 months ago

@mi-volodin just ran the unit tests; these are super light weight, just wanted to see where we were at. No worries :)

mi-volodin commented 3 months ago

@benc-db Now I am done with code and testing. Before switching to the changelog and documentation - there's an open question regarding pre-checking certain conditions on compile time.

For instance, I can encode certain "impossible" cases to be checked at compile time. And if failed - throw an exception.

I am asking, because I am not sure it is really needed. Basically I will be able to put some simple checks like having at least matched/not matched/not matched on source step, but definitely not everything that possibly can be wrong.

What do you think?

benc-db commented 3 months ago

For instance, I can encode certain "impossible" cases to be checked at compile time. And if failed - throw an exception. I am asking, because I am not sure it is really needed. Basically I will be able to put some simple checks like having at least matched/not matched/not matched on source step, but definitely not everything that possibly can be wrong.

This is a really good question. In general, we allow Databricks to tell the user what went wrong; the exception is if its very easy to misconfigure something, or if the Databricks exception is misleading. So, in those impossible cases, do you think its easy to read the error and figure out what you need to change to fix it? If not, it would be a good idea to raise compilation errors that tell the users what they need to do to configure things correctly.

mi-volodin commented 3 months ago

Yes, in my cases the syntax error will be raised. And Databricks will point that out. I think then I will skip these exceptions for now.

I have one more stupid question: I cannot find any trace of documentation in that repo. Is it tracked separately?

benc-db commented 3 months ago

I have one more stupid question: I cannot find any trace of documentation in that repo. Is it tracked separately?

Yes, I submit a PR to dbt for that part. If you include the doc locally, I'll make the PR with 1.9 to get the doc updated on the site.

mi-volodin commented 3 months ago

@benc-db By including locally, do you mean writing an MD file with explanations and put in in the ./docs folder?

I also can create a PR for updating the documentation in dbt-core repo, no problem. Just wanted to know what would be the most convenient way.

benc-db commented 3 months ago

@benc-db By including locally, do you mean writing an MD file with explanations and put in in the ./docs folder?

Yes, exactly. I already will need to submit a doc update for 1.9 anyway, so I'll fold in the doc you put into the docs folder.

mi-volodin commented 3 months ago

@benc-db let me know if I can do anything else to support the process of testing.

Meanwhile I can investigate and report the snapshot test issue.

benc-db commented 3 months ago

Running functional tests today, then will start reviewing. Thanks for your efforts!

mi-volodin commented 3 months ago

@benc-db I see some tests are failing due to hardcoded "DBTINTERNAL*" aliases. I will take care of it. Apologies that I forgot to look into it beforehand... had a thought, but something distracted me.

mi-volodin commented 3 months ago

Regarding the OnSchemaChange test. This is the generated (correctly) merge statement (model_a has up to field4):

merge into
    `hive_metastore`.`test17229350022373340845_test_incremental_on_schema_change`.`incremental_ignore` as tgt
using
    `incremental_ignore__dbt_tmp` as src -- select id, field1, field2, field3, field4 from mode_a
on
    src.id <=> tgt.id
when matched
    then update set
        * -- this is supposed to ignore field3 and field4, but it never executed
when not matched
    then insert
        * -- this adds field3, field4 and evolves the schema
mi-volodin commented 3 months ago

@benc-db I fixed unit tests. Among tests that are failing I looked through - majority is my Snapshot issue. Would be interested to know if it also fails on your side.

There are two more tests that are failing. First one, as described above, is supposed to test "ignore_schema" behaviour. But, I think it doesn't work by design if the schema evolution is enabled by default in the environment. I.e. in that case the merge operation that happens execute only inserts, and that inserts extend the schema to have field3, field4.

Later diff test generates the code to compare it against "target" which has only field1,2 and expectedly fails while querying for field3,4. This is how result for calculated table looks like

image

If you confirm that it should be fixed - I can fix it withing this PR.

The last test is test_wpm which fails, because I think it shouldn't be included in my test-set at all (I don't have warehouses and UC). Also can fix it here (exclude).

benc-db commented 3 months ago

Apologies for delay, I'm on-call and haven't had much time to review PRs. Will get to this shortly.

benc-db commented 3 months ago

Rerunning functional tests

benc-db commented 3 months ago

All functional tests pass. Just need to find time to actually review the code :P . Thanks so much!

benc-db commented 3 months ago

@mi-volodin thank you for your hard work. I will be going on vacation this week, and will start working on the 1.9 release when I return in September. Just wanted to let you know.

mi-volodin commented 3 months ago

@benc-db thanks! And I have just returned from mine šŸ˜„ Very happy to see it merged, and I still plan to look into python models in tests. Also please loop me in if any issues are be uncovered. Have a nice vacation!

benc-db commented 3 weeks ago

@mi-volodin I'm going to be opening a new PR to make the default match the prior aliases. In beta testing, some users have macro overrides that assume the previous naming, leading to failures.