Open andy-tarr opened 1 year ago
Hi Andy, Glad we had a chat as said during our call the problem is that the object is not receiving the right argument. If you follow the procedure I outlined during our chat and how I have outlined below in code you should be able to get it starting:
from xxx import StatisticalDistance
import numpy as np
import pandas as pd
def calculate_ecdf(filtered_sample: list) -> pd.DataFrame:
"""
Will calculate the eCDF to find the empirical distribution of our population.
This function will then create a dataframe which will contain the measure
and its probability.
Furthermore this function also gives the option to filter out the outliers
on the extreme ends of our new distribution.
More info about empirical cumulative distribution functions can be found here:
https://en.wikipedia.org/wiki/Empirical_distribution_function
:param filtered_sample: A normalized list of measurements
:return: a data frame containing the ECDF
"""
ecdf = pd.DataFrame(
{
'measure': np.sort(filtered_sample),
'probability': np.arange(len(filtered_sample)) / float(len(filtered_sample)),
}
).fillna(0.00)
return ecdf[~(ecdf['probability'] >= 0.98)]
def outlier_filter(data: list) -> list:
"""
Will filter out any and all outliers based on the 95th percentile of the test.
:return: A list of floats that excluded its outliers.
"""
percentile = np.percentile(data, 95)
return [value for value in data if value <= percentile]
x = outlier_filter([your array here])
y = outlier_filter([your array here])
StatisticalDistance(
baseline_ecdf=calculate_ecdf(x),
benchmark_ecdf=calculate_ecdf(y),
heuristic={
"rank":
[
{"wasserstein_boundary": 0.020, "kolmogorov_smirnov_boundary": 0.075, "rank": "S"},
{"wasserstein_boundary": 0.030, "kolmogorov_smirnov_boundary": 0.090, "rank": "A"},
{"wasserstein_boundary": 0.050, "kolmogorov_smirnov_boundary": 0.120, "rank": "B"},
{"wasserstein_boundary": 0.055, "kolmogorov_smirnov_boundary": 0.140, "rank": "C"},
{"wasserstein_boundary": 0.075, "kolmogorov_smirnov_boundary": 0.160, "rank": "D"},
{"wasserstein_boundary": 0.090, "kolmogorov_smirnov_boundary": 0.180, "rank": "E"},
{"wasserstein_boundary": 0.105, "kolmogorov_smirnov_boundary": 0.200, "rank": "F"}
],
"score":
{
"wasserstein_distance":
{
"matrix_size": 100,
"start_value": 0.001,
"increment": 0.001
},
"kolmogorov_smirnov_distance":
{
"matrix_size": 100,
"start_value": 0.010,
"increment": 0.001
}
}
}
)
# Optional to normalize the data.
# first filter the data below 95th percentile
# calculate the ECDF
# Give objects to the Distance test
I also wrote about pre-processing your raw data before analysis in my latest article linked here:
Let me know if you need any additional help.
Cheers, Joey
I've just started looking at the project to trial it with my CI/CD pipeline but i'm hitting a few issues:
readme section Quickly get started comparing the results of two performance tests references StatisticalDistanceTest but this doesn't exist. File is here https://github.com/JoeyHendricks/automated-performance-test-result-analysis/blob/master/heuristics/kolmogorov_smirnov_and_wasserstein.py but class doesn't exist. Should it be 'StatisticalDistance' ?
Assuming the above is correct, running the sample code gives errors (location_hendricks_set_001 changed to my filepath)
Errors:
I'm not very familar with python (Java background). Thanks in advance!