elementary-data / elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
https://www.elementary-data.com/
Apache License 2.0
1.9k stars 162 forks source link

Applying `hour_of_week` seasonality causes `days_back` and `training_period` configurations to be ignored #1349

Open garfieldthesam opened 9 months ago

garfieldthesam commented 9 months ago

Describe the bug

When using hour_of_week seasonality the training set goes back to 2016 in violation of the days_back and/or training_period configurations.

To Reproduce Steps to reproduce the behavior:

1. Run anomaly test without seasonality Example

- elementary.dimension_anomalies:
   name: insert_name
   dimensions: [dimension_name]
   time_bucket:
     period: hour
     count: 1
   timestamp_column: created_ts
   days_back: 366
   backfill_days: 1
   anomaly_direction: both   
   anomaly_sensitivity: 2 
   severity: warn 
   error_if: '>4' 
   warn_if: '>2'  

Look at compiled code and see that in the create table statement's first CTE this is the code used to generate the date buckets:

(
  select
    explode(
      sequence(
        cast('2023-01-02T00:00:00+00:00' as timestamp),
        cast('2024-01-03T22:00:00+00:00' as timestamp),
        interval 1 hour
      )
    ) as edr_bucket_start
)

2. Run anomaly test with this seasonality

Example:

- elementary.dimension_anomalies:
   name: insert_name
   dimensions: [dimension_name]
   time_bucket:
     period: hour
     count: 1
   timestamp_column: created_ts
   days_back: 366
   backfill_days: 1
   anomaly_direction: both   
   anomaly_sensitivity: 2 
   severity: warn 
   error_if: '>4' 
   warn_if: '>2'  
   seasonality: hour_of_week

Look at compiled code and see that in the create table statement's first CTE this is the code used to generate the date buckets:

(
  select
    explode(
      sequence(
        cast(
          date_trunc('hour', cast('2016-12-28 00:00:00' as timestamp)) as timestamp
        ),
        cast('2024-01-03 22:09:07.509656+00:00' as timestamp),
        interval 1 hour
      )
    ) as edr_bucket_start
)

Expected behavior I would not expect window for the query to extend back 6 years. One year should have 52 week-hour combinations and therefore be more than enough to generate a reasonably reliable seasonality adjustment.

6 years makes more sense if the test is for week-of-year-and-hour seasonality, not week-and-hour seasonality, but that doesn't make sense to me as a way to compute this seasonality and that's not my understanding of how it works after reading the Elementary docs.

Environment (please complete the following information):

dbt packages:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
  - package: LewisDavies/upstream_prod
    version: 0.7.2
  - package: calogica/dbt_expectations
    version: 0.10.0
  - package: elementary-data/elementary
    version: 0.13.0

dbt version 1.6.7 dbt-databricks version 1.6.6

Maayan-s commented 9 months ago

Hey @garfieldthesam , I understand why this is unexpected, but this is actually intentional. At first when we introduced seasonality, users didn't configure the days_back accordingly and kept seeing "not enough data to calculate an anomaly" result. We did a naive solution -

image

I believe the more mature approach would be to increase the training set size only if needed, and only to reach the minimal training set size. For example -

For this configuration:

days_back: 60
seasonality: day_of_week
min_training_set_size: 14

We should not naively do days_back*7, but do:

if min_training_set_size * 7 < days back (
  days_back = min_training_set_size * 7
)
else days_back

@ellakz @dapollak what do you think?

garfieldthesam commented 9 months ago

@Maayan-s ah I see but one easy thing to fix: the docs you pictured above only say this applies just to day_of_week and not hour_of_week.

I believe the more mature approach would be to increase the training set size only if needed, and only to reach the minimal training set size.

I agree with this

But, regarding this proposal:

if min_training_set_size * 7 < days back (
  days_back = min_training_set_size * 7
)
else days_back

I disagree. I am pretty sure that the way most users think of min_training_set_size is as a scalar number that the elementary packages will never change. For example, if I set min_training_set_size = 70, days_back = 70, and seasonality=day_of_week, I expect elementary to run the test using 70 days (ie. 10 weeks) of data.

It might be made clear in the docs that this isn't how min_training_set_size works but not everyone is going to read the docs for every parameter they use, particularly if the name feels self-explanatory