elementary-data / elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
https://www.elementary-data.com/
Apache License 2.0
1.9k stars 159 forks source link

detection_period propery works wrong #1523

Open dmitrii-khr opened 4 months ago

dmitrii-khr commented 4 months ago

Describe the bug I'd like to complain again about the detection_period configuration property. The following setup

 detection_period:
                period: day
                count: 1

transforms to the following statement in the elementary query which decides whether alert or not alert:

bucket_end >= 
    dateadd(day, cast('-1' as integer), cast(max_bucket_end as timestamp))

The non-strict condition means that it will alert for 2 days. But it is expected that the alert will be only one 1 day. Now it is not possible to configure a time series test, to alert it for the current day and not for issues that were yesterday. 0 value causes test failure.

To Reproduce setup time-series test with bucket size 1 day and detection_period 1 day

Expected behavior With detection period of 1 day, test will not fail if start date of failed time bucket is more than 24 hours ago. Condition which detects if test should fail have to be strict:

bucket_end >
    dateadd(day, cast('-1' as integer), cast(max_bucket_end as timestamp))

Environment (please complete the following information):

dmitrii-khr commented 4 months ago

wrong repo

dmitrii-khr commented 4 months ago

https://github.com/elementary-data/dbt-data-reliability/issues/711

haritamar commented 4 months ago

HI @dmitrii-khr - actually we prefer Elementary issues to be concentrated in this repo so it's good that you opened it here. I'll close the other one actually.

Currently, this is expected behavior, though I understand the confusion. When you use a detection period of 1 day, we actually include the last full bucket as the detection - in that sense the test will work only on yesterday and not on today. So it's actually not a bug - however daily buckets are currently always from midnight to midnight, and not the "last 24 hours".

If you'd like to make the test more real-time, you may want to consider decreasing the time bucket from daily to less than that. For example, hourly buckets can be set like this:

time_bucket:
    period: hour
    count: 1

Please let me know if this makes sense. Thanks!

dmitrii-khr commented 4 months ago

Hi! Thank you for the reaction!

When you use a detection period of 1 day, we actually include the last full bucket as the detection - in that sense the test will work only on yesterday and not on today.

What I can see from logs works in other way.

Let's consider an example.
Column anomaly test. Timestamp column with time. No detection delay. Data is updated many times a day including the current day (2024-05-21). Test considers complete buckets only. The first query determines the following boundaries for buckets: image Notice that today's values are not going to be in the buckets at all.

Going forward with logs and intermediate results we can get the following picture: image Bucket 2024-05-19 to 2024-05-20 marked as is_anomalous. It is not the last full bucket. It is not yesterday, it is 2 days ago.

It happens because of the non-strict condition in the anomaly_scores_with_is_anomalous CTE: and bucket_end >= dateadd(day, cast('-1' as integer), cast(max_bucket_end as timestamp))

Overall test fails: image

dmitrii-khr commented 3 months ago

Any thoughts?

haritamar commented 3 months ago

Hi @dmitrii-khr , Thanks for sharing and for the details. I agree that this is a bug.

Would you like to contribute the fix? (I imagine this is just changing the condition as you wrote + testing that it produces the required results)

ewengillies commented 2 months ago

hi, thanks for summarizing @dmitrii-khr - we're facing the same issues.

@haritamar this is a critical bug that means we're constantly getting notified of issues that we resolved the day before. It amounts to "however many days you add in your detection period, elementary is going to go ahead and add 'one' to this". This is counter to all of Elementary's documentation.

Even worse, as a "hack", neither of these configurations work since they don't pass the validate_mandatory_configuration macro (which makes sense, we should never have to put a zero value):

 detection_period:
     period: day
     count: 0

nor the old setting

backfill_days: 0

Given the intended functionality doesn't work and the hack doesn't work, we're completely stuck, as are all your other users. I hate to sound critical, but the trivial one-line fix for this critical bug was then delegated to an open source contributor who has already spent the time carefully explaining & documenting this issue.

Two months after it was surfaced, and three weeks after the issue was fixed, we're still stuck with a critical bug.

Where is the accountability? Where is the patch that should have been shipped three weeks ago, if not two months ago? We like your package but with this level of remediation it's looking like it'd be easier to re-implement anomaly detection macros and maintain an internal copy ourselves.

Is there anyway for us to override this erroneous behavior while we wait for a patch?