JoeyHendricks / STATS-PAL

A relatively simple but powerful heuristic that can automate performance test result analysis by using powerful statistics.
GNU General Public License v3.0
29 stars 4 forks source link

Trouble getting started #5

Open andy-tarr opened 1 year ago

andy-tarr commented 1 year ago

I've just started looking at the project to trial it with my CI/CD pipeline but i'm hitting a few issues:

readme section Quickly get started comparing the results of two performance tests references StatisticalDistanceTest but this doesn't exist. File is here https://github.com/JoeyHendricks/automated-performance-test-result-analysis/blob/master/heuristics/kolmogorov_smirnov_and_wasserstein.py but class doesn't exist. Should it be 'StatisticalDistance' ?

Assuming the above is correct, running the sample code gives errors (location_hendricks_set_001 changed to my filepath)

from heuristics.kolmogorov_smirnov_and_wasserstein import StatisticalDistance
from data import location_hendricks_set_001  # <-- My primary example data set.
from data.wranglers import ConvertCsvResultsIntoDictionary

# As an example I provided a way to quickly convert a csv file into a Python dictionary.
raw_data = ConvertCsvResultsIntoDictionary(location_hendricks_set_001).data

# Run the distance test against the given data.
stats_distance_test = StatisticalDistance(
 population_a=raw_data["RID-1"]["response_times"],
 population_b=raw_data["RID-2"]["response_times"]
)

# Below printed information can be used to control a CI/CD pipeline. 
print(stats_distance_test.kolmogorov_smirnov_distance)  # >> 0.096
print(stats_distance_test.wasserstein_distance)  # >> 0.100
print(stats_distance_test.score)  # >> 89.70
print(stats_distance_test.rank)  # >> C

Errors:

Traceback (most recent call last):
  File "C:\Users\Andy Tarr\AppData\Roaming\JetBrains\IntelliJIdea2023.1\plugins\python\helpers\pydev\pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File "C:\Users\Andy Tarr\AppData\Roaming\JetBrains\IntelliJIdea2023.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 198, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "C:\Users\Andy Tarr\AppData\Roaming\JetBrains\IntelliJIdea2023.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:\Github\automated-performance-test-result-analysis\simulations\Trial.py", line 9, in <module>
    stats_distance_test = StatisticalDistance(
TypeError: __init__() got an unexpected keyword argument 'population_a'

I'm not very familar with python (Java background). Thanks in advance!

JoeyHendricks commented 1 year ago

Hi Andy, Glad we had a chat as said during our call the problem is that the object is not receiving the right argument. If you follow the procedure I outlined during our chat and how I have outlined below in code you should be able to get it starting:

from xxx import StatisticalDistance
import numpy as np
import pandas as pd

def calculate_ecdf(filtered_sample: list) -> pd.DataFrame:
    """
    Will calculate the eCDF to find the empirical distribution of our population.
    This function will then create a dataframe which will contain the measure
    and its probability.

    Furthermore this function also gives the option to filter out the outliers
    on the extreme ends of our new distribution.
    More info about empirical cumulative distribution functions can be found here:
    https://en.wikipedia.org/wiki/Empirical_distribution_function

    :param filtered_sample: A normalized list of measurements
    :return: a data frame containing the ECDF
    """
    ecdf = pd.DataFrame(
        {
            'measure': np.sort(filtered_sample),
            'probability': np.arange(len(filtered_sample)) / float(len(filtered_sample)),
        }
    ).fillna(0.00)
    return ecdf[~(ecdf['probability'] >= 0.98)]

def outlier_filter(data: list) -> list:
    """
    Will filter out any and all outliers based on the 95th percentile of the test.
    :return: A list of floats that excluded its outliers.
    """
    percentile = np.percentile(data, 95)
    return [value for value in data if value <= percentile]

x = outlier_filter([your array here])
y = outlier_filter([your array here])

StatisticalDistance(
    baseline_ecdf=calculate_ecdf(x),
    benchmark_ecdf=calculate_ecdf(y),
    heuristic={
        "rank":
            [
                {"wasserstein_boundary": 0.020, "kolmogorov_smirnov_boundary": 0.075, "rank": "S"},
                {"wasserstein_boundary": 0.030, "kolmogorov_smirnov_boundary": 0.090, "rank": "A"},
                {"wasserstein_boundary": 0.050, "kolmogorov_smirnov_boundary": 0.120, "rank": "B"},
                {"wasserstein_boundary": 0.055, "kolmogorov_smirnov_boundary": 0.140, "rank": "C"},
                {"wasserstein_boundary": 0.075, "kolmogorov_smirnov_boundary": 0.160, "rank": "D"},
                {"wasserstein_boundary": 0.090, "kolmogorov_smirnov_boundary": 0.180, "rank": "E"},
                {"wasserstein_boundary": 0.105, "kolmogorov_smirnov_boundary": 0.200, "rank": "F"}
            ],
        "score":
            {
                "wasserstein_distance":
                    {
                        "matrix_size": 100,
                        "start_value": 0.001,
                        "increment": 0.001
                    },
                "kolmogorov_smirnov_distance":
                    {
                        "matrix_size": 100,
                        "start_value": 0.010,
                        "increment": 0.001
                    }
            }
    }
)

# Optional to normalize the data.
# first filter the data below 95th percentile
# calculate the ECDF
# Give objects to the Distance test

I also wrote about pre-processing your raw data before analysis in my latest article linked here:

Let me know if you need any additional help.

Cheers, Joey