elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.5k stars 24.89k forks source link

[ML] Improve handling of annual time changes (daylight savings) in anomaly detection #107115

Open maxhniebergall opened 8 months ago

maxhniebergall commented 8 months ago

Description

Annual time changes for daylight savings happen at different times in different parts of the globe, and cause changes in usage (relative to the time of day) which can cause false-positive anomaly alerts. This feature request is to discuss improving the handling of annual time changes to avoid or reduce false-positive alerts.

elasticsearchmachine commented 8 months ago

Pinging @elastic/ml-core (Team:ML)

tveasey commented 7 months ago

Currently, we test for time shifts including +/- 1 hr and as soon as we are confident update the model. Unfortunately, in the case of DLS we do not suppress the anomalies whilst this is happening and it takes us some time to be confident there has been a shift.

Some thoughts:

  1. One issue we have is in older versions of the code base we have no way of saying "update the model but suppress results" using calendar events, as we do for rules. In newer versions we simply down weight the update greatly, but still check for certain events such time shifts (this has different pros and cons). The cleanest approach would be to use the same notion of suppress results or suppress updates for calendar events. This would allow people to cleanly say "I don't care to be told about alerts in this time period, but I would like the model to adapt". This provides a route to be able to configure a better experience for DLS.
  2. We have some challenges with doing this automatically: i) different time series experience DLS on different dates even for the same job depending where the data came from, ii) some time series experience no DLS, because for example they monitor something using fixed interval scheduling. Whatever we do I don't think we can blindly apply DLS time shift based on a calendar.

A possible enhancement:

Allow users to specify a calendar to use for DLS (i.e. which days to expect it to occur). On the DLS event day we would

  1. Parameterise our time shift test to make it quicker to apply a +/- 1 hr shift
  2. Compute anomalies for each time series using shifted (+/- 1hr) and unshifted time series and use the minimum anomaly score for either

[ 3. Try and learn the DLS calendar which applies to the job.]

Point 2 is important since any event which is actually a DLS time shift should then generate no anomalies, but we would not disable detection altogether.

Point 3 is a nice to have which allows for no configuration, but adds additional implementation complexity and does still need some time to learn the calendar, so users would have to accept anomalies on the first few DLS events.

tveasey commented 7 months ago

I had a look through the code. We already support event type correctly in the anomaly detection backend so supporting skip_result for scheduled events simply requires us to:

  1. Add a type field to scheduled events (default to skip_model_update)
  2. Pass the configured type to the backend rather than hard code to skip_model_update. (We already pass these events as detection rules, so this doesn't require us to change the Java to autodetect communication.)

This is relatively low effort and we should do this first.