Open annav00 opened 5 months ago
Hello,
I confirm this bug which is a bit annoying. When we run models in parallel then we end up with duplicates in the tables. As workaround I need to run those queries periodically.
create or replace table elementary.dbt_tests as (
select * from elementary.dbt_tests qualify row_number() over (partition by unique_id order by generated_at) = 1
);
create or replace table elementary.dbt_models as (
select * from elementary.dbt_models qualify row_number() over (partition by unique_id order by generated_at) = 1
);
create or replace table elementary.dbt_sources as (
select * from elementary.dbt_sources qualify row_number() over (partition by unique_id order by generated_at) = 1
);
create or replace table elementary.dbt_exposures as (
select * from elementary.dbt_exposures qualify row_number() over (partition by unique_id order by generated_at) = 1
);
create or replace table elementary.dbt_columns as (
select * from elementary.dbt_columns qualify row_number() over (partition by unique_id order by generated_at) = 1
);
Hi @annav00 and @MICHM137 , Sorry for the delay in responding here, can you please confirm if this issue is still relevant to you? Also - which databases are you using?
I'll mark this as high priority on our end.
In the meantime, a workaround you can consider setting the var cache_artifacts = False
- this will force a full replace of the artifacts on every run, which I think should actually prevent duplicates (though it can increase on_run_end duration).
(When caching is enabled we only insert a diff - and I think there's probably a race there)
Hey (it is MICHM137), I would avoid adding cache_artifacts False because the on_run_end duration takes already a lot of time and make our pipelines way longer than without elementary. Do you plan optimizing the on_run_end hook? Thanks for your answer
Hi @mattxxi , Yeah makes sense. We implemented the cache due to performance reasons, just pointed out the alternative.
I think the duplicate entries when the cache is enabled results from a race in the delete_and_insert
macro, which we need to fix.
I don't have an immediate time frame for it but I'm guessing we'll prioritize it in the near future.
Do you have timelines for the fix? Is the issue happening in latest version of elementary as well?
We're observing pretty similar behavior but for Slack Alerts. Would love to have this fixed, it'd be huge for cleaning up our notifications and making them more actionable.
Describe the bug
The
on-run-end
hook saves data to artifact table. When running tests/models in parallel, duplicate entries in artifact tables sometimes occur. Because of this, when generating a report, duplicate records with information about tests appear.To Reproduce Steps to reproduce the behavior:
Expected behavior The report contains one entry for each inspection.
Screenshots Example: elementary_test_results 1 record dbt_sources 2 records dbt_tests 6 records -> there are 12 records in the report.
Environment (please complete the following information):
Additional context Perhaps it is possible to solve the problem of duplication in artifact tables when working in parallel. Or maybe can distinct records when query the data for the report.