earthgecko / skyline

Anomaly detection
http://earthgecko-skyline.readthedocs.io/en/latest/
Other
535 stars 63 forks source link

How Does Skyline Handle Seasonal Data. #231

Closed dvshah closed 4 years ago

dvshah commented 4 years ago

I am interested in knowing about the skyline integration on seasonal data. I have looked through 3-sigma algorithms and feel like only FIRST_HOUR_AVERAGE Is the only algorithm that can detect seasonal sequence anomalies. For example, if data on a particular day has low peaks than the previous day than skyline algorithms won't mark it as an anomaly. Or am I understanding the algorithms wrong? Can you please give me some insight on this.

earthgecko commented 4 years ago

Hi @dvshah rather than any specific algorithm handling seasonality, Skyline relies on the data exhibiting seasonality. By default analysis is run on 24 hours worth of data with Skyline Analyzer, however 24 hours of data rarely includes the seasonal patterns. To overcome this Skyline allows you to define what the metric/s seasonality is, this allows Skyline Analyzer to push any anomalies triggered in the 24 hour data off the Skyline Mirage module. Mirage will surface the metric data at whatever seasonality you have defined on the metric (for example, let us say 7 days) from Graphite and analyse that data set to see if the event that triggered in the 24 hr data is anomalous over the seasonality defined for the metric, if it is, it will be classed anomalous.

The same algorithms are applied, they are just applied to a longer data set which would generally include any seasonal patterns that existed in the metric.

Further to that, it is possible to configure Skyline to run MIRAGE_PERIODIC_CHECK so that every seasonal (e.g. Mirage metric) defined in MIRAGE_PERIODIC_CHECK_NAMESPACES is analysed by Mirage with the defined seasonality at least once every MIRAGE_PERIODIC_CHECK_INTERVAL. This generally handles most of the:

if data on a particular day has low peaks than the previous day than skyline algorithms won't mark it as an anomaly

To handle any kind of seasonality on metric/s they must be defined as Mirage metrics in the ALERTS tuple with a SECOND_ORDER_RESOLUTION_HOURS defined for the metric or metric namespace in the alert tuple itself. Obviously you need to enable and run the Mirage service to handle this.

I hope that helps.

For further info see: https://earthgecko-skyline.readthedocs.io/en/latest/mirage.html https://earthgecko-skyline.readthedocs.io/en/latest/mirage.html#periodic-checks https://earthgecko-skyline.readthedocs.io/en/latest/skyline.html#settings.ALERTS

dvshah commented 4 years ago

Ok, I don't think I am able to understand what you are trying to say. My point is that 3 sigma algorithms do not work well for seasonal data. Let's say data has 1-day seasonality. So It makes more sense to compare my datapoint with datapoints 1 day ago or 2 days ago ......

In most of the skyline algorithms, even if I shuffle the previous data those algorithms will give the same results implying that those algorithm does not have to do anything with the pattern of the data (shuffling does not change mean and standard deviation of data).

earthgecko commented 4 years ago

Hi @dvshah indeed 3 sigma analysis has its limitations and some downfalls and irritatingly sometimes it is dependent on the data itself, it works better with some data than others.

If you have an example data set, I would be grateful if you could share it with me so I can take a look and share any ideas you may have on any algorithm/s that may better suited to the data you are describing. A description of how you are shuffling the data and testing the algorithms would be helpful too.

Depending on the range of the data, a 50% lower peak in a day may not trigger as anomalous in terms of 3 sigma and depending on the data that can be expected. 3 sigma is not fantastic at detecting decreases on fairly low range data and unfortunately I have not discovered a method as of yet to handle this type of detection better. Although in all honesty I have not put a lot of time into finding a method either in terms of the algorithms, but it is something that would be nice to solve.

However due to these limitations the Boundary module was specifically added to allow you to monitor things like this in a manner that is not data dependent and does not differ in performance due to past variability in the data, allowing you to define simple, specific algorithms for metrics. Like a sales metric for example, if x < y in z seconds with the additional ability to define x as a min average over z seconds/data points. Boundary is there to service metrics where the varying unsupervised 3 sigma reliability is not sufficient, define absolute ranges (supervised models) on key metrics. The boundary analysis and analyzer/mirage analysis are independent of each other and happily coexist as independent services using the same Redis data.

But to further explain what is meant by allowing the data to describe the seasonality let us take the following example 24 hour time series.

graphite 24h

We can see there is a significantly large deviation after 8PM. If we analyse this data with the 3 sigma algorithms using the 24 hour data that analyzer uses, we get the following anomalies with an EXPIRATION_TIME of 3600 seconds and a CONSENSUS of 6 has been applied.

crucible 24h

The EXPIRATION_TIME defining that an anomaly should not be labelled within 1 hour after an anomaly has been labelled and CONSENSUS of 6 meaning that at least 6 of the 9 3 sigma algorithms must trigger for a data point to be labelled as anomalous. Under these conditions we would get 4 alerts/anomalies. Not ideal.

Now let us say that we define this metric as having a seasonality of 7 days, we know that we can expect certain behaviour (peaks and troughs) over the 7 day period, which may not be described in the 24 hour period. If we surface the 7 day time series:

graphite 7days

It is immediately evident that similar spikes are seen almost daily and the spike seen after 8PM in the 24 hour data, when considered against the 7 day data, is not so significant. If we now analyse the same metric as Mirage does at 7 days, (in this case with some normal Graphite down sampling/aggregation, in this case down sampled from 1 data point per 60s to 1 data point per 600s), using the same algorithms, EXPIRATION_TIME and CONSENSUS

crucible 7days

It is clear that the 8-9PM peak would not be labelled as anomalous because in the 7 day time series as the peak is within the normal range of the similar periodic peaks which occur throughout the period. In fact, there would be no alerts/anomalies triggered for last 24 hours, when the metric is analysed in the context of 7 days. If it were only analysed in the 24 hour context, there would have been 4 false positive alerts/anomalies.

This method works fairly well at handling normal variations that are seen quite frequently over a longer period but rarely in the short period (e.g. 24 hr) and if the data was only analysed at the short period, these seasonal/periodic expected variations are not accounted for, by defining the metric seasonality as 168 hours does change what data points are labelled as anomalous.

However the same applies even over 7 days, 3 sigma is not great at detecting gradual decreases, especially in relatively low range time series.

I hope this helps explain the concept of seasonality in Skyline and I look forward to getting an example data set and some good ideas on how to solve the 3 sigma does not work too well on lower range gradually decreasing time series problem :)

dvshah commented 4 years ago

share_earthgecko

This is the data file. the timestamp is in total_seconds. Data is hourly. data.zip

Now, if we see data on the second Tuesday(~200th data point), there is a huge dip in data that should be detected as an anomaly. But if we consider this datapoint with a one day window then the data point is not anomalous. Also if we see data on the second Friday then it is very low as compared to previous weekdays. Hence it should also be marked as an anomaly.

Also, the peak on the third Monday can be considered anomalous and I believe that the skyline will be able to detect it. That is because the data value at that point is completely outside the range of previous data points.

What can be done is to only consider data points at the same hour while calculating anomalies(With 1-2 week window). But that is one option. Does Skyline have the functionality to detect these types of anomalies? I am currently running the algorithms with data having same hour and it is working fine(low consensus of course since many algorithms won't work on a low number of data points and did some tweaking to reduce false positives).

earthgecko commented 4 years ago

Hi @dvshah the consensus based 3 sigma is not going to trigger any of those data points in that time series as none of the data points are anomalous in terms of the 3 sigma algorithms, not matter what the sample size is used, 13 to 30 Jan, 7 days, 3 days, 24 hours makes no difference. histogram_bins is the only algorithm that is going to trigger. Especially if you are just the analyzer algorithms as they are expecting higher frequency data than a data point per hour. Using crucible and the crucible algorithms, unlike analyzer, determines the FULL_DURATION of the time series rather than just using the default 86400 and adjusts the data used in the algorithm to match the frequency. However crucible will not trigger any anomalies on this data set either, even with adjustments to suit the frequency of the data.

Adding your algorithm is not too difficult, however it is not as easy as it should be. It sounds like you have managed to for testing although it is difficult to make Skyline use you custom CONSENSUS without you having to change the code.

I will add some functionality to make it easier to add custom algorithms and define custom requirements and CONSENSUS, etc for custom algorithms. I will let you know.

earthgecko commented 4 years ago

Hi @dvshah you can now add your custom algorithm as of commit 7e604212cc7d6370198c1b3e1c8fca7f246b8182 (ensure you diff your settings.py with to one in the commit to determine the new settings variables that are required). Note this is a py3 feature only.

When you upgrade to 7e604212cc7d6370198c1b3e1c8fca7f246b8182 ensure that you install the new dependency in your Skyline Python environment:

pip install timeout-decorator==0.4.1

This custom algorithm functionality allows you to add your algorithm and run it overridding 3 sigma CONSENSUS with a settings.CUSTOM_ALGORITHMS set to something like this example:

    CUSTOM_ALGORITHMS = {
        'last_same_hours': {
            'namespaces': ['dvshah.sales'],
            'algorithm_source': '/home/dvshah/skyline/custom_algorithms/last_same_hours.py',
            # Pass the argument 604800 for the sample_period parameter and
            # enable debug_logging in the algorithm itself
            'algorithm_parameters': {
              'sample_period': 604800,
              'debug_logging': True
            },
            'max_execution_time': 0.3,
            'consensus': 1,
            'algorithms_allowed_in_consensus': [],
            'run_3sigma_algorithms': False,
            # This does not run on analyzer as it is weekly data
            'use_with': ['mirage', 'crucible'],
            'debug_logging': True,
        },
    }

Use the example algorithm referenced in https://earthgecko-skyline.readthedocs.io/en/latest/algorithms/custom-algorithms.html#last-same-hours and modify that with your algorithm code, You also need to ensure that the metric is defined as a mirage enabled metric in the normal way, ensuring it matches a smtp alert defined in settings.ALERTS with a SECOND_ORDER_RESOLUTION_HOURS declared as for a normal Mirage metric. The final step required in your situation, is to add it to the settings.MIRAGE_ALWAYS_METRICS list, this ensures that even if Analyzer's 3 sigma algorithms do not trigger an anomaly, it will still be sent to Mirage to be analysed on every Analyzer run. Restart mirage and that should be it.

Please read the full documentation which can be found here: https://earthgecko-skyline.readthedocs.io/en/latest/algorithms/custom-algorithms.html

If you have any problems after reading the documentation and implementing your algorithm, just reopen this.

Good luck and happy anomaly hunting.