elastic / ml-cpp

Machine learning C++ code
Other
7 stars 62 forks source link

[ML] Improve handling of bucket count variation for mean value anomaly detection #1386

Open tveasey opened 4 years ago

tveasey commented 4 years ago

Currently, we use a worst case estimate of the impact of changes in the count of values in a bucket on their mean variance. This is safe in the sense of not generating false positives, but can lead to large changes in the model plot bounds and potentially false negatives when the count of values in the bucket is low.

Specifically, we assume all measurements are independent so that the variance of the mean statistic will be proportional to 1 / "number samples" in the bucket. If the rate of values is highly variable this can lead to large increases in the width of the model bounds when the count is low. Unfortunately, this is not calibrated to the actual data behaviour. For example, in the other extreme, if all measurements in each bucket were perfectly correlated then we would get no change in variation of the mean statistic as a function of bucket count.

It would be possible to estimate the relationship between the bucket count and the sufficient statistics related to data variation for all the residual distributions we fit since we know the sample count for each bucket. This would also be a more accurate way of calibrating heavy tailed distributions like the log-normal to observed changes in the seasonal variation.

A computationally feasible formulation would be to use linear regression. For example, for the normal model we could fit the linear model (x_i - m)^2 = [c_i s_i] [p_1 p_2]^t for parameters p_1 and p_2, observed bucket values x_i, mean of x_i m and bucket count and seasonal variance scale c_i and s_i, respectively. If we solve this in the least squares sense we only need to maintain a small set of statistics rather than all bucket values, which is the key to this being tractable for us in the streaming setting. The same formulation carries over for the log mean and log variance we estimate for the log-normal distribution and so on.

sophiec20 commented 4 years ago

Above applies to functions mean, median and variance.

LucaWintergerst commented 2 years ago

Tom was just helping me troubleshoot this problem with one of our out of the box APM jobs

This is what this behaviour looks like:

Screenshot 2022-06-30 at 13 14 08

and this is a fixed version where we used a summary_count_field_name field that always had the value 1. In this particular case I was doing a cardinality agg on a field with a cardinality of 1, but there should be lots of other options too. Some kind of scripted agg might even be quicker.

Screenshot 2022-06-30 at 17 50 38

As you can tell from the screenshots, the new version looks much better

valeriy42 commented 4 months ago

The current behavior:

image