elementary-data / elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
https://www.elementary-data.com/
Apache License 2.0
1.81k stars 152 forks source link

Duplicate entries in the report. #1420

Open annav00 opened 5 months ago

annav00 commented 5 months ago

Describe the bug

The on-run-end hook saves data to artifact table. When running tests/models in parallel, duplicate entries in artifact tables sometimes occur. Because of this, when generating a report, duplicate records with information about tests appear.

To Reproduce Steps to reproduce the behavior:

  1. Run tests/models in parallel as separate queries.
  2. Generate the report.

Expected behavior The report contains one entry for each inspection.

Screenshots Example: elementary_test_results 1 record dbt_sources 2 records dbt_tests 6 records -> there are 12 records in the report.

image

Environment (please complete the following information):

Additional context Perhaps it is possible to solve the problem of duplication in artifact tables when working in parallel. Or maybe can distinct records when query the data for the report.

MICHM137 commented 3 months ago

Hello,

I confirm this bug which is a bit annoying. When we run models in parallel then we end up with duplicates in the tables. As workaround I need to run those queries periodically.

create or replace table elementary.dbt_tests as (
    select * from elementary.dbt_tests qualify row_number() over (partition by unique_id order by generated_at) = 1 
);
create or replace table elementary.dbt_models as (
    select * from elementary.dbt_models qualify row_number() over (partition by unique_id order by generated_at) = 1 
);
create or replace table elementary.dbt_sources as (
    select * from elementary.dbt_sources qualify row_number() over (partition by unique_id order by generated_at) = 1 
);
create or replace table elementary.dbt_exposures as (
    select * from elementary.dbt_exposures qualify row_number() over (partition by unique_id order by generated_at) = 1
);
create or replace table elementary.dbt_columns as (
    select * from elementary.dbt_columns qualify row_number() over (partition by unique_id order by generated_at) = 1
);
haritamar commented 1 month ago

Hi @annav00 and @MICHM137 , Sorry for the delay in responding here, can you please confirm if this issue is still relevant to you? Also - which databases are you using?

I'll mark this as high priority on our end.

In the meantime, a workaround you can consider setting the var cache_artifacts = False - this will force a full replace of the artifacts on every run, which I think should actually prevent duplicates (though it can increase on_run_end duration).

(When caching is enabled we only insert a diff - and I think there's probably a race there)

mattxxi commented 1 month ago

Hey (it is MICHM137), I would avoid adding cache_artifacts False because the on_run_end duration takes already a lot of time and make our pipelines way longer than without elementary. Do you plan optimizing the on_run_end hook? Thanks for your answer

haritamar commented 1 month ago

Hi @mattxxi , Yeah makes sense. We implemented the cache due to performance reasons, just pointed out the alternative.

I think the duplicate entries when the cache is enabled results from a race in the delete_and_insert macro, which we need to fix. I don't have an immediate time frame for it but I'm guessing we'll prioritize it in the near future.

sindhuthirugnanam commented 3 weeks ago

Do you have timelines for the fix? Is the issue happening in latest version of elementary as well?

cbear99 commented 6 days ago

We're observing pretty similar behavior but for Slack Alerts. Would love to have this fixed, it'd be huge for cleaning up our notifications and making them more actionable.