Trouble getting started

JoeyHendricks / STATS-PAL

A relatively simple but powerful heuristic that can automate performance test result analysis by using powerful statistics.

GNU General Public License v3.0

29 stars 4 forks source link

from heuristics.kolmogorov_smirnov_and_wasserstein import StatisticalDistance from data import location_hendricks_set_001 # <-- My primary example data set. from data.wranglers import ConvertCsvResultsIntoDictionary # As an example I provided a way to quickly convert a csv file into a Python dictionary. raw_data = ConvertCsvResultsIntoDictionary(location_hendricks_set_001).data # Run the distance test against the given data. stats_distance_test = StatisticalDistance( population_a=raw_data["RID-1"]["response_times"], population_b=raw_data["RID-2"]["response_times"] ) # Below printed information can be used to control a CI/CD pipeline. print(stats_distance_test.kolmogorov_smirnov_distance) # >> 0.096 print(stats_distance_test.wasserstein_distance) # >> 0.100 print(stats_distance_test.score) # >> 89.70 print(stats_distance_test.rank) # >> C

Traceback (most recent call last): File "C:\Users\Andy Tarr\AppData\Roaming\JetBrains\IntelliJIdea2023.1\plugins\python\helpers\pydev\pydevconsole.py", line 364, in runcode coro = func() File "<input>", line 1, in <module> File "C:\Users\Andy Tarr\AppData\Roaming\JetBrains\IntelliJIdea2023.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 198, in runfile pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File "C:\Users\Andy Tarr\AppData\Roaming\JetBrains\IntelliJIdea2023.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "C:\Github\automated-performance-test-result-analysis\simulations\Trial.py", line 9, in <module> stats_distance_test = StatisticalDistance( TypeError: __init__() got an unexpected keyword argument 'population_a'

Hi Andy, Glad we had a chat as said during our call the problem is that the object is not receiving the right argument. If you follow the procedure I outlined during our chat and how I have outlined below in code you should be able to get it starting:

from xxx import StatisticalDistance
import numpy as np
import pandas as pd

def calculate_ecdf(filtered_sample: list) -> pd.DataFrame:
    """
    Will calculate the eCDF to find the empirical distribution of our population.
    This function will then create a dataframe which will contain the measure
    and its probability.

    Furthermore this function also gives the option to filter out the outliers
    on the extreme ends of our new distribution.
    More info about empirical cumulative distribution functions can be found here:
    https://en.wikipedia.org/wiki/Empirical_distribution_function

    :param filtered_sample: A normalized list of measurements
    :return: a data frame containing the ECDF
    """
    ecdf = pd.DataFrame(
        {
            'measure': np.sort(filtered_sample),
            'probability': np.arange(len(filtered_sample)) / float(len(filtered_sample)),
        }
    ).fillna(0.00)
    return ecdf[~(ecdf['probability'] >= 0.98)]

def outlier_filter(data: list) -> list:
    """
    Will filter out any and all outliers based on the 95th percentile of the test.
    :return: A list of floats that excluded its outliers.
    """
    percentile = np.percentile(data, 95)
    return [value for value in data if value <= percentile]

x = outlier_filter([your array here])
y = outlier_filter([your array here])

StatisticalDistance(
    baseline_ecdf=calculate_ecdf(x),
    benchmark_ecdf=calculate_ecdf(y),
    heuristic={
        "rank":
            [
                {"wasserstein_boundary": 0.020, "kolmogorov_smirnov_boundary": 0.075, "rank": "S"},
                {"wasserstein_boundary": 0.030, "kolmogorov_smirnov_boundary": 0.090, "rank": "A"},
                {"wasserstein_boundary": 0.050, "kolmogorov_smirnov_boundary": 0.120, "rank": "B"},
                {"wasserstein_boundary": 0.055, "kolmogorov_smirnov_boundary": 0.140, "rank": "C"},
                {"wasserstein_boundary": 0.075, "kolmogorov_smirnov_boundary": 0.160, "rank": "D"},
                {"wasserstein_boundary": 0.090, "kolmogorov_smirnov_boundary": 0.180, "rank": "E"},
                {"wasserstein_boundary": 0.105, "kolmogorov_smirnov_boundary": 0.200, "rank": "F"}
            ],
        "score":
            {
                "wasserstein_distance":
                    {
                        "matrix_size": 100,
                        "start_value": 0.001,
                        "increment": 0.001
                    },
                "kolmogorov_smirnov_distance":
                    {
                        "matrix_size": 100,
                        "start_value": 0.010,
                        "increment": 0.001
                    }
            }
    }
)

# Optional to normalize the data.
# first filter the data below 95th percentile
# calculate the ECDF
# Give objects to the Distance test

I also wrote about pre-processing your raw data before analysis in my latest article linked here:

https://www.linkedin.com/pulse/statistical-magic-spells-automate-performance-test-result-hendricks/

Let me know if you need any additional help.

Cheers, Joey

JoeyHendricks / STATS-PAL

Trouble getting started #5