Extend Merge Capabilities

mi-volodin commented 4 months ago

Resolves #245 Resolves #645
Resolves #707

Motivation

The current MERGE INTO capabilities available in dbt (data build tool) significantly differ from what can be achieved through a pure SQL interface. While it is sometimes possible to circumvent these limitations using complex SQL techniques, this approach often increases code complexity. Additionally, because Spark does not execute some operations identically when comparing direct merges with workaround solutions, this can lead to performance issues.

Description

This PR aims to add the following capabilities:

Add flag skip_matched_step to be able to skip matched step in merge statement. This and other mentioned flags are considered active only if they equal to a true string (case-insensitive).
Add flag skip_not_matched_step to be able to skip not matched step in merge statement.
Add flag merge_with_schema_evolution to allow new capability of MERGE to evolve schema.
Add matched_conditions and not_matched_conditions:
- The default conditions can be expressed as a string only, as it looks more convenient for complex conditions (mix of and and or, functions)
- Default aliases for source and target are src and tgt respectively, but...
- it is possible to set aliases using target_alias or source_alias parameters.
- Rationale and background: see discussion
Add delete option for merge. This will be triggered by two parameters: not_matched_by_source_action which should be set to delete. Also it is possible to add not_matched_by_source_condition which defines additional predicates that should resolve to true for delete to happen.

DOD checklist

Tests

[x] Model that has updates with skip_matched_step == true updates nothing.
[x] Model that has new records with skip_not_matched_step == true inserts nothing.
[x] Model with extended schema adds the new attribute when used in the input model. When attribute is not provided in the old records - it remains null.
[x] Test for matched_conditions and not_matched_conditions that covers all cases of logic (matched / not + conditions met / not) + also cover custom alias testing.
[x] Test for not_matched_by_source removes the record that is not on source, keeps the record which is not on source but the not_matched_by_source_condition is not met.

Open questions

[x] Do I need to implement exceptions for compile time for the case I have some inconsistent flag states (e.g. all steps in merge are requested to be skipped)?
- Decision is NO, as Databricks should cover syntax errors reporting.

Docs

[x] Update documentation with the new capabilities (see added docs/databricks-merge.md
[x] Changelog

Checklist

[x] I have run this code in development and it appears to resolve the stated issue
[x] This PR includes tests, or tests are not required/relevant for this PR
[x] I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

mi-volodin commented 3 months ago

❗ My local tests are failing mostly on the snapshot steps. And the reason for it concealed in dbt base tests, which I found buggy.

It works like this:

First run we generate the snapshot from scratch, populating the dbt_ fields. Also dbt_unique_id.

Next run when we update snapshot dbt core snapshot logic (see dbt/include/global_project/macros/materializations/snapshots/helpers.sql) executes the following

snapshotted_data as (

    select *,
        {{ strategy.unique_key }} as dbt_unique_key

    from {{ target_relation }}
    where dbt_valid_to is null

),

If we look in the tests definition of dbt, the unique_key field will be id of a seed. (l.123 dbt/tests/adapter/basic/files.py). And therefore the following sql is generated during tests
```
snapshotted_data as (

    select *,
        id as dbt_unique_key

    from [EXISTING_SNAPSHOT]
    where dbt_valid_to is null

),
```

It cannot be resolved, because snapshotted_data now has duplicated dbt_unique_key attribute, and when it is referenced below - query fails.

benc-db commented 3 months ago

❗ My local tests are failing mostly on the snapshot steps. And the reason for it concealed in dbt base tests, which I found buggy.

It works like this:
First run we generate the snapshot from scratch, populating the dbt_ fields. Also dbt_unique_id.
Next run when we update snapshot dbt core snapshot logic (see dbt/include/global_project/macros/materializations/snapshots/helpers.sql) executes the following
snapshotted_data as (

   select *,
       {{ strategy.unique_key }} as dbt_unique_key

   from {{ target_relation }}
   where dbt_valid_to is null

),
If we look in the tests definition of dbt, the unique_key field will be id of a seed. (l.123 dbt/tests/adapter/basic/files.py). And therefore the following sql is generated during tests
snapshotted_data as (

   select *,
       id as dbt_unique_key

   from [EXISTING_SNAPSHOT]
   where dbt_valid_to is null

),
It cannot be resolved, because snapshotted_data now has duplicated dbt_unique_key attribute, and when it is referenced below - query fails.

Can we clarify, is the test buggy, or our implementation? Is this something we should raise to dbt Labs?

benc-db commented 3 months ago

@mi-volodin if it's not too painful, can you rebase against 1.9.latest branch next week? This feature (and a couple of others I've worked on) is big enough to increment feature version. I'll get 1.9.latest updated today to include latest changes from main.

mi-volodin commented 3 months ago

@benc-db sure thing. TBH never done such operation before, so would be curious how painful is this 😄

Let me know when I can start. I see neither tag nor branch with 1.9... Maybe I am looking in a wrong place. Anyway, I'd appreciate if you ping me once the time comes.

mi-volodin commented 3 months ago

Can we clarify, is the test buggy, or our implementation? Is this something we should raise to dbt Labs?

For now it looks purely as dbt side bug. Which is suspicuous... In dbt-databricks both the snapshot code and the tests are taken from dbt-core. So let me finish with the development and I will finalise the investigation.

I planned to raise it in dbt-core, wrote this note here just to explain that it is not related to my changes.

mi-volodin commented 3 months ago

@benc-db rebased

mi-volodin commented 3 months ago

@benc-db sorry, rebased, but not tested yet. Let me fix some issues and finalise the tests before running GH actions.

benc-db commented 3 months ago

@mi-volodin just ran the unit tests; these are super light weight, just wanted to see where we were at. No worries :)

mi-volodin commented 3 months ago

@benc-db Now I am done with code and testing. Before switching to the changelog and documentation - there's an open question regarding pre-checking certain conditions on compile time.

For instance, I can encode certain "impossible" cases to be checked at compile time. And if failed - throw an exception.

I am asking, because I am not sure it is really needed. Basically I will be able to put some simple checks like having at least matched/not matched/not matched on source step, but definitely not everything that possibly can be wrong.

What do you think?

benc-db commented 3 months ago

For instance, I can encode certain "impossible" cases to be checked at compile time. And if failed - throw an exception. I am asking, because I am not sure it is really needed. Basically I will be able to put some simple checks like having at least matched/not matched/not matched on source step, but definitely not everything that possibly can be wrong.

This is a really good question. In general, we allow Databricks to tell the user what went wrong; the exception is if its very easy to misconfigure something, or if the Databricks exception is misleading. So, in those impossible cases, do you think its easy to read the error and figure out what you need to change to fix it? If not, it would be a good idea to raise compilation errors that tell the users what they need to do to configure things correctly.

mi-volodin commented 3 months ago

Yes, in my cases the syntax error will be raised. And Databricks will point that out. I think then I will skip these exceptions for now.

I have one more stupid question: I cannot find any trace of documentation in that repo. Is it tracked separately?

benc-db commented 3 months ago

I have one more stupid question: I cannot find any trace of documentation in that repo. Is it tracked separately?

Yes, I submit a PR to dbt for that part. If you include the doc locally, I'll make the PR with 1.9 to get the doc updated on the site.

mi-volodin commented 3 months ago

@benc-db By including locally, do you mean writing an MD file with explanations and put in in the ./docs folder?

I also can create a PR for updating the documentation in dbt-core repo, no problem. Just wanted to know what would be the most convenient way.

benc-db commented 3 months ago

@benc-db By including locally, do you mean writing an MD file with explanations and put in in the ./docs folder?

Yes, exactly. I already will need to submit a doc update for 1.9 anyway, so I'll fold in the doc you put into the docs folder.

mi-volodin commented 3 months ago

@benc-db let me know if I can do anything else to support the process of testing.

Meanwhile I can investigate and report the snapshot test issue.

benc-db commented 3 months ago

Running functional tests today, then will start reviewing. Thanks for your efforts!

mi-volodin commented 3 months ago

@benc-db I see some tests are failing due to hardcoded "DBTINTERNAL*" aliases. I will take care of it. Apologies that I forgot to look into it beforehand... had a thought, but something distracted me.

mi-volodin commented 3 months ago

[x] tox -e linter
[x] tox -e unit
[ ] `tox -e integration_databricks_cluster:
- [x] TestPersistDocsWithSeeds failed with error
```
[DELTA_CANNOT_CREATE_LOG_PATH] Cannot create s3a://[redacted]-warehouse/mnt/dbt_databricks/seeds/persist_seed/_delta_log
```
  Expected, as this path does not exist. Probably should work for external testing.
- [x] TestSnapshot.test_inserts_are_captured_by_snapshot failed with ambigous reference error (see above)
- [x] TestSnapshot.test_deletes_are_captured_by_snapshot same
- [x] TestSnapshot.test_new_column_captured_by_snapshot same
- [ ] TestIncrementalOnSchemaChange.test_run_incremental_ignore failed with the following error
```
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `field3` cannot be resolved. Did you mean one of the following? [`field1`, `field2`, `id`]. SQLSTATE: 42703; line 7 pos 40
```
  ❗ Fails for a reason. The test is organised as follows: we create ignore table with 3 attributes and try to merge a model_a with 5 attributes. It should have on_schema_change='ignore' parameter set. But, the test is designed to actually only insert. So if schema evolution is enabled (by default) - it will insert field3, field4. Later DIFF test generates code based on the attribute set from the result. And that's why it fails - our example against which we validate doesn't have field3,4.
- [x] TestSnapshotCheckCols.test_snapshot_check_cols same snapshot error
- [x] TestSnapshotTimestamp.test_snapshot_timestamp same snapshot error
- [x] TestSnapshot.test_revives_are_captured_by_snapshot same snapshot error
- [ ] TestWarehousePerModel.test_wpm ??? (no output)

Regarding the OnSchemaChange test. This is the generated (correctly) merge statement (model_a has up to field4):

merge into
    `hive_metastore`.`test17229350022373340845_test_incremental_on_schema_change`.`incremental_ignore` as tgt
using
    `incremental_ignore__dbt_tmp` as src -- select id, field1, field2, field3, field4 from mode_a
on
    src.id <=> tgt.id
when matched
    then update set
        * -- this is supposed to ignore field3 and field4, but it never executed
when not matched
    then insert
        * -- this adds field3, field4 and evolves the schema

mi-volodin commented 3 months ago

@benc-db I fixed unit tests. Among tests that are failing I looked through - majority is my Snapshot issue. Would be interested to know if it also fails on your side.

There are two more tests that are failing. First one, as described above, is supposed to test "ignore_schema" behaviour. But, I think it doesn't work by design if the schema evolution is enabled by default in the environment. I.e. in that case the merge operation that happens execute only inserts, and that inserts extend the schema to have field3, field4.

Later diff test generates the code to compare it against "target" which has only field1,2 and expectedly fails while querying for field3,4. This is how result for calculated table looks like

If you confirm that it should be fixed - I can fix it withing this PR.

The last test is test_wpm which fails, because I think it shouldn't be included in my test-set at all (I don't have warehouses and UC). Also can fix it here (exclude).

benc-db commented 3 months ago

Apologies for delay, I'm on-call and haven't had much time to review PRs. Will get to this shortly.

benc-db commented 3 months ago

Rerunning functional tests

benc-db commented 3 months ago

All functional tests pass. Just need to find time to actually review the code :P . Thanks so much!

benc-db commented 3 months ago

@mi-volodin thank you for your hard work. I will be going on vacation this week, and will start working on the 1.9 release when I return in September. Just wanted to let you know.

mi-volodin commented 3 months ago

@benc-db thanks! And I have just returned from mine 😄 Very happy to see it merged, and I still plan to look into python models in tests. Also please loop me in if any issues are be uncovered. Have a nice vacation!

benc-db commented 3 weeks ago

@mi-volodin I'm going to be opening a new PR to make the default match the prior aliases. In beta testing, some users have macro overrides that assume the previous naming, leading to failures.

databricks / dbt-databricks