elementary-data / elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
https://www.elementary-data.com/
Apache License 2.0
1.92k stars 165 forks source link

Add --hours-back CLI option for edr monitor #1548

Open miktros opened 5 months ago

miktros commented 5 months ago

Is your feature request related to a problem? Please describe.

Documentation for volume_anomalies lists hour as an option for configuring detection_period. However, configuring detection_period using hour results in compilation error: Missing mandatory configuration: ['backfill_days']

Describe the solution you'd like

Elementary tests like volume_anomalies test allow configuring time_bucket by the hour. I would like to be able to configure the detection_period using the hour option so that I can arrange for test runs such that anomaly alerts are emitted based on test failures of comparing the row count of the most recent hourly time bucket of detection_period against row count of time buckets for the last training_period days.

Describe alternatives you've considered

Introduce a new CLI option --hours-back for edr monitor to optionally set a number-of-hours limit to how far back should edr monitor look for pending alerts. If provided, it overrides --days-back.

I have a POC implementation that seems to work. PR to add optional --hours-back for edr monitor here.

Additional context

None.

Would you be willing to contribute this feature?

I am open to contributing to this feature and would appreciate any guidance you can provide.

ellakz commented 4 months ago

Hi @miktros Indeed, when we added the detection_period and training_period params we translated them under the hood to "days_back" and "backfill_days", making "hour" unit not be supported. The solution for this is to change the operation under the hood in the dbt package and make the queries really use hours for the training and detection. However, in your PR you added an "hours_back" option to the monitor CLI so I'm not sure I understand how it solves that same problem? If you could shed some light on it I would be happy to review. Thanks!

miktros commented 4 months ago

By splitting data into 1-hour time buckets in volume.anomalies tests, we were hoping to be able to run dbt test and then edr monitor every hour so that alert is emitted only for a test with row count in the most recent 1-hour time bucket that is lower than past 21 day average (time bucket failure). We use the "hours-back" option, edr monitor --hours-back 1, to achieve the alerting behavior as described above. With the "days-back" option, alert is emitted each time edr monitor is run when there is a time bucket failure any time during the day, even though the most recent time bucket has no failure. In short, we want per hour alert notification that reflects failure condition of the most recent hour.

ellakz commented 4 months ago

So I'm not sure this behavior really works, because when the failed test happened in the last hour an alert will be fired, even if the failed metric isn't from the last hour, because you have a detection period of 1 day - so every failure in the past 1 day will fail the test, and if it happens in the hour prior to running edr monitor an alert will be fired.

A different workaround I can suggest is to use the alert_suppression_interval flag in your CLI - setting it to 24 for example will only alert on the same issue once every 24 hours, preventing you from getting duplicate alerts on the same issue. Does this help?

If not- If I understand correctly, what you want to do it open a PR in the dbt-package repo, changing the way the detection_period param is handled in the test queries. In here You can see the param 'detection_period' is being translated to 'backfill_days' which is being then piped into the rest of the macros handling the test query, but actually handling it and translating it to the chosen unit and setting the query accordingly is the required change to achieve the behavior you need. I understand this is not the easiest contribution to do, and at some point we will probably do it ourselves. But I can't commit to a timeline at the moment unfortunately.