elastic / ml-cpp

Machine learning C++ code
Other
149 stars 62 forks source link

[ML] Don't try and correct for sample count when estimating statistic variances for anomaly detection #2677

Open tveasey opened 2 months ago

tveasey commented 2 months ago

Currently, we:

  1. Try to ensure each metric sample contains the same number of values
  2. Correct for the variance assuming the raw values are independent samples from some distribution when we compute time bucket anomaly scores

This adds significant complexity to sampling metrics and creates a disconnect between the data we show in visualisations and the data we use for anomaly detection. Furthermore, the independence assumption frequently does not hold. In this case our current behaviour can lead to false negatives. For data where outages are associated with a significant fall in data rate this is particularly problematic. The choice to try to correct the variance predated modelling periodic variance, which now better accounts for the most common case, that the data rate is periodic.

In this PR I have reverted to using the raw time bucket statistics for model update and anomaly detection. I rely on periodic variance estimation to deal with (common instances of) time varying data rate. This is a step towards #1386.

valeriy42 commented 1 month ago

Behaviour before the change:

image

Behaviour after the change:

image

I use the following script to generate synthetic data:

import pandas as pd
import numpy as np

def generate_variable_frequency_data():
    """Generate variable frequency throughput data with a failure scenario.

    Returns:
        pandas.DataFrame: A DataFrame containing the generated data with two columns:
            - '@timefield': Timestamps of the data points.
            - 'transaction_throughput': Throughput values at each timestamp.
    """
    # Define start and end dates
    start_date = pd.to_datetime("2024-04-01")
    end_date = pd.to_datetime("2024-04-21")  # 20 days period

    # Initialize lists to store timestamps and throughput values
    timestamps = []
    throughput_values = []

    # Initial timestamp
    current_time = start_date

    while current_time <= end_date:
        # Append the current timestamp
        timestamps.append(current_time)

        # Generate a throughput value with normal variability
        throughput = np.random.normal(200, 50)
        throughput = max(0, throughput)  # Ensure non-negative throughput
        throughput_values.append(throughput)

        # Generate the next timestamp using a sinusoidal frequency with noise with period of 24 hours
        base_frequency = 10  # base frequency in seconds
        sinusoidal_variation = 50 * np.sin(
            2 * np.pi * current_time.hour / 24
        )  # sinusoidal variation
        noise = np.random.normal(0, 5)  # noise
        interval = base_frequency + sinusoidal_variation + noise

        # Simulate a drop in frequency after a certain date
        if current_time > pd.to_datetime(
            "2024-04-18"
        ) and current_time < pd.to_datetime("2024-04-19"):
            interval *= 25  # Increase the interval by 2500%
            throughput_values[-1] = 0
        # Calculate the next timestamp
        current_time += pd.to_timedelta(abs(interval), unit="s")

    return pd.DataFrame(
        {"@timefield": timestamps, "transaction_throughput": throughput_values}
    )

if __name__ == "__main__":

    # Generate data
    data = generate_variable_frequency_data()

    # Save the data to a CSV file
    data.to_csv("variable_frequency_throughput_data.csv", index=False)

Hence, while data frequency is time-dependent, the metric value throughput is actually a normally distributed value. After the change, its confidence interval is correctly estimated, while before the change, the confidence bounds were changing over time, which is wrong.

\cc @tveasey