Closed mi-volodin closed 3 months ago
ā My local tests are failing mostly on the snapshot steps. And the reason for it concealed in dbt base tests, which I found buggy.
It works like this:
First run we generate the snapshot from scratch, populating the dbt_
fields. Also dbt_unique_id
.
Next run when we update snapshot dbt
core snapshot
logic (see dbt/include/global_project/macros/materializations/snapshots/helpers.sql
) executes the following
snapshotted_data as (
select *,
{{ strategy.unique_key }} as dbt_unique_key
from {{ target_relation }}
where dbt_valid_to is null
),
If we look in the tests definition of dbt, the unique_key
field will be id
of a seed. (l.123 dbt/tests/adapter/basic/files.py
). And therefore the following sql is generated during tests
snapshotted_data as (
select *,
id as dbt_unique_key
from [EXISTING_SNAPSHOT]
where dbt_valid_to is null
),
It cannot be resolved, because snapshotted_data
now has duplicated dbt_unique_key
attribute, and when it is referenced below - query fails.
ā My local tests are failing mostly on the snapshot steps. And the reason for it concealed in dbt base tests, which I found buggy.
It works like this:
- First run we generate the snapshot from scratch, populating the
dbt_
fields. Alsodbt_unique_id
.Next run when we update snapshot
dbt
coresnapshot
logic (seedbt/include/global_project/macros/materializations/snapshots/helpers.sql
) executes the followingsnapshotted_data as ( select *, {{ strategy.unique_key }} as dbt_unique_key from {{ target_relation }} where dbt_valid_to is null ),
If we look in the tests definition of dbt, the
unique_key
field will beid
of a seed. (l.123dbt/tests/adapter/basic/files.py
). And therefore the following sql is generated during testssnapshotted_data as ( select *, id as dbt_unique_key from [EXISTING_SNAPSHOT] where dbt_valid_to is null ),
It cannot be resolved, because
snapshotted_data
now has duplicateddbt_unique_key
attribute, and when it is referenced below - query fails.
Can we clarify, is the test buggy, or our implementation? Is this something we should raise to dbt Labs?
@mi-volodin if it's not too painful, can you rebase against 1.9.latest branch next week? This feature (and a couple of others I've worked on) is big enough to increment feature version. I'll get 1.9.latest updated today to include latest changes from main.
@benc-db sure thing. TBH never done such operation before, so would be curious how painful is this š
Let me know when I can start. I see neither tag nor branch with 1.9... Maybe I am looking in a wrong place. Anyway, I'd appreciate if you ping me once the time comes.
Can we clarify, is the test buggy, or our implementation? Is this something we should raise to dbt Labs?
For now it looks purely as dbt side bug. Which is suspicuous... In dbt-databricks
both the snapshot code and the tests are taken from dbt-core
. So let me finish with the development and I will finalise the investigation.
I planned to raise it in dbt-core, wrote this note here just to explain that it is not related to my changes.
@benc-db rebased
@benc-db sorry, rebased, but not tested yet. Let me fix some issues and finalise the tests before running GH actions.
@mi-volodin just ran the unit tests; these are super light weight, just wanted to see where we were at. No worries :)
@benc-db Now I am done with code and testing. Before switching to the changelog and documentation - there's an open question regarding pre-checking certain conditions on compile time.
For instance, I can encode certain "impossible" cases to be checked at compile time. And if failed - throw an exception.
I am asking, because I am not sure it is really needed. Basically I will be able to put some simple checks like having at least matched/not matched/not matched on source step, but definitely not everything that possibly can be wrong.
What do you think?
For instance, I can encode certain "impossible" cases to be checked at compile time. And if failed - throw an exception. I am asking, because I am not sure it is really needed. Basically I will be able to put some simple checks like having at least matched/not matched/not matched on source step, but definitely not everything that possibly can be wrong.
This is a really good question. In general, we allow Databricks to tell the user what went wrong; the exception is if its very easy to misconfigure something, or if the Databricks exception is misleading. So, in those impossible cases, do you think its easy to read the error and figure out what you need to change to fix it? If not, it would be a good idea to raise compilation errors that tell the users what they need to do to configure things correctly.
Yes, in my cases the syntax error will be raised. And Databricks will point that out. I think then I will skip these exceptions for now.
I have one more stupid question: I cannot find any trace of documentation in that repo. Is it tracked separately?
I have one more stupid question: I cannot find any trace of documentation in that repo. Is it tracked separately?
Yes, I submit a PR to dbt for that part. If you include the doc locally, I'll make the PR with 1.9 to get the doc updated on the site.
@benc-db By including locally, do you mean writing an MD file with explanations and put in in the ./docs
folder?
I also can create a PR for updating the documentation in dbt-core repo, no problem. Just wanted to know what would be the most convenient way.
@benc-db By including locally, do you mean writing an MD file with explanations and put in in the ./docs folder?
Yes, exactly. I already will need to submit a doc update for 1.9 anyway, so I'll fold in the doc you put into the docs folder.
@benc-db let me know if I can do anything else to support the process of testing.
Meanwhile I can investigate and report the snapshot test issue.
Running functional tests today, then will start reviewing. Thanks for your efforts!
@benc-db I see some tests are failing due to hardcoded "DBTINTERNAL*" aliases. I will take care of it. Apologies that I forgot to look into it beforehand... had a thought, but something distracted me.
tox -e linter
tox -e unit
TestPersistDocsWithSeeds
failed with error
[DELTA_CANNOT_CREATE_LOG_PATH] Cannot create s3a://[redacted]-warehouse/mnt/dbt_databricks/seeds/persist_seed/_delta_log
Expected, as this path does not exist. Probably should work for external testing.
TestSnapshot.test_inserts_are_captured_by_snapshot
failed with ambigous reference error (see above)TestSnapshot.test_deletes_are_captured_by_snapshot
sameTestSnapshot.test_new_column_captured_by_snapshot
sameTestIncrementalOnSchemaChange.test_run_incremental_ignore
failed with the following error
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `field3` cannot be resolved. Did you mean one of the following? [`field1`, `field2`, `id`]. SQLSTATE: 42703; line 7 pos 40
ā Fails for a reason. The test is organised as follows: we create ignore
table with 3 attributes and try to merge a model_a
with 5 attributes. It should have on_schema_change='ignore'
parameter set. But, the test is designed to actually only insert. So if schema evolution is enabled (by default) - it will insert field3, field4. Later DIFF test generates code based on the attribute set from the result. And that's why it fails - our example against which we validate doesn't have field3,4
.
TestSnapshotCheckCols.test_snapshot_check_cols
same snapshot errorTestSnapshotTimestamp.test_snapshot_timestamp
same snapshot errorTestSnapshot.test_revives_are_captured_by_snapshot
same snapshot errorTestWarehousePerModel.test_wpm
??? (no output)Regarding the OnSchemaChange test. This is the generated (correctly) merge statement (model_a has up to field4):
merge into
`hive_metastore`.`test17229350022373340845_test_incremental_on_schema_change`.`incremental_ignore` as tgt
using
`incremental_ignore__dbt_tmp` as src -- select id, field1, field2, field3, field4 from mode_a
on
src.id <=> tgt.id
when matched
then update set
* -- this is supposed to ignore field3 and field4, but it never executed
when not matched
then insert
* -- this adds field3, field4 and evolves the schema
@benc-db I fixed unit tests. Among tests that are failing I looked through - majority is my Snapshot issue. Would be interested to know if it also fails on your side.
There are two more tests that are failing. First one, as described above, is supposed to test "ignore_schema" behaviour. But, I think it doesn't work by design if the schema evolution is enabled by default in the environment. I.e. in that case the merge operation that happens execute only inserts, and that inserts extend the schema to have field3, field4
.
Later diff test generates the code to compare it against "target" which has only field1,2
and expectedly fails while querying for field3,4
. This is how result for calculated table looks like
If you confirm that it should be fixed - I can fix it withing this PR.
The last test is test_wpm
which fails, because I think it shouldn't be included in my test-set at all (I don't have warehouses and UC). Also can fix it here (exclude).
Apologies for delay, I'm on-call and haven't had much time to review PRs. Will get to this shortly.
Rerunning functional tests
All functional tests pass. Just need to find time to actually review the code :P . Thanks so much!
@mi-volodin thank you for your hard work. I will be going on vacation this week, and will start working on the 1.9 release when I return in September. Just wanted to let you know.
@benc-db thanks! And I have just returned from mine š Very happy to see it merged, and I still plan to look into python models in tests. Also please loop me in if any issues are be uncovered. Have a nice vacation!
@mi-volodin I'm going to be opening a new PR to make the default match the prior aliases. In beta testing, some users have macro overrides that assume the previous naming, leading to failures.
Resolves #245 Resolves #645
Resolves #707
Motivation
The current MERGE INTO capabilities available in dbt (data build tool) significantly differ from what can be achieved through a pure SQL interface. While it is sometimes possible to circumvent these limitations using complex SQL techniques, this approach often increases code complexity. Additionally, because Spark does not execute some operations identically when comparing direct merges with workaround solutions, this can lead to performance issues.
Description
This PR aims to add the following capabilities:
skip_matched_step
to be able to skip matched step in merge statement. This and other mentioned flags are considered active only if they equal to atrue
string (case-insensitive).skip_not_matched_step
to be able to skip not matched step in merge statement.merge_with_schema_evolution
to allow new capability of MERGE to evolve schema.matched_conditions
andnot_matched_conditions
:and
andor
, functions)src
andtgt
respectively, but...target_alias
orsource_alias
parameters.delete
option for merge. This will be triggered by two parameters:not_matched_by_source_action
which should be set todelete
. Also it is possible to addnot_matched_by_source_condition
which defines additional predicates that should resolve totrue
for delete to happen.DOD checklist
Tests
skip_matched_step == true
updates nothing.skip_not_matched_step == true
inserts nothing.null
.matched_conditions
andnot_matched_conditions
that covers all cases of logic (matched / not + conditions met / not) + also cover custom alias testing.not_matched_by_source
removes the record that is not on source, keeps the record which is not on source but thenot_matched_by_source_condition
is not met.Open questions
Docs
docs/databricks-merge.md
Checklist
CHANGELOG.md
and added information about my change to the "dbt-databricks next" section.