Closed ETA444 closed 7 months ago
Implementation Summary:
The distribution_analysis
method within the explore_num()
function is designed to analyse the distribution characteristics of numerical variables, including skewness, kurtosis, and normality tests. The goal is to provide insights into the distribution of the data, which is crucial for certain statistical analyses.
Purpose:
The purpose of this implementation is to analyse the distribution characteristics of numerical data, including skewness, kurtosis, and normality tests.
Code Breakdown:
Title and Overview:
result.append(f"\n<<______DISTRIBUTION ANALYSIS______>>\n")
result.append(f"✎ Overview of Results*\n")
Define and Initialize Variables:
stats_functions = [
'min', 'max', 'mean', 'median', 'mode', 'variance',
'std_dev', 'skewness', 'kurtosis', 'shapiro_p', 'anderson_stat'
]
stats_dict = {stat: [] for stat in stats_functions}
Calculate Statistics:
for variable_name in numerical_variables:
data = df[variable_name].dropna()
var_min, var_max = data.min(), data.max()
mean = data.mean()
median = data.median()
mode = data.mode().tolist()
variance = data.var()
std_dev = data.std()
skewness = skew(data)
kurt = kurtosis(data)
shapiro_stat, shapiro_p = shapiro(data)
anderson_stat = anderson(data)
stats_dict['min'].append(var_min)
stats_dict['max'].append(var_max)
stats_dict['mean'].append(mean)
stats_dict['median'].append(median)
stats_dict['mode'].append(mode[0] if mode else pd.NA)
stats_dict['variance'].append(variance)
stats_dict['std_dev'].append(std_dev)
stats_dict['skewness'].append(skewness)
stats_dict['kurtosis'].append(kurt)
stats_dict['shapiro_p'].append(shapiro_p)
stats_dict['anderson_stat'].append(anderson_stat.statistic)
Interpretation Tips:
for variable_name in numerical_variables:
result.append(f"\n< Distribution Analysis Summary for: ['{variable_name}'] >\n")
result.append(f"➡ Min: {stats_dict['min'][-1]:.2f}\n➡ Max: {stats_dict['max'][-1]:.2f}\n➡ Mean: {stats_dict['mean'][-1]:.2f}\n➡ Median: {stats_dict['median'][-1]:.2f}\n➡ Mode(s): {stats_dict['mode'][-1]}")
result.append(f"➡ Variance: {stats_dict['variance'][-1]:.2f}\n➡ Standard Deviation: {stats_dict['std_dev'][-1]:.2f}")
result.append(f"\n➡ Skewness: {stats_dict['skewness'][-1]:.2f}\n ☻ Tip: Symmetric if ~0, left-skewed if <0, right-skewed if >0")
result.append(f"\n➡ Kurtosis: {stats_dict['kurtosis'][-1]:.2f}\n ☻ Tip: Mesokurtic if ~0, Platykurtic if <0, Leptokurtic if >0)")
result.append(f"\n★ Shapiro-Wilk Test for Normality:\n • H0: Data is normally distributed.\n • H1: Data is not normally distributed.\n ➡ p-value = {stats_dict['shapiro_p'][-1]:.4f}\n {'✘ Conclusion: Not normally distributed.' if stats_dict['shapiro_p'][-1] < 0.05 else '✔ Conclusion: Normally distributed.'}\n")
result.append(f"\n★ Anderson-Darling Test for Normality:\n ➡ statistic = {stats_dict['anderson_stat'][-1]:.4f}\n ➡ significance levels = {anderson(data).significance_level}\n ➡ critical values = {anderson(data).critical_values}\n ☻ Tip: Compare the statistic to critical values. Data is likely not normally distributed if the statistic > critical value.\n")
Construct DataFrame:
distribution_df = pd.concat(
[pd.DataFrame({variable_name: [stats_dict[stat][i]] for i, variable_name in enumerate(numerical_variables)}, index=[stat])
for stat, values in stats_dict.items()], axis=0
).T
distribution_df.columns = stats_functions
distribution_df.index.name = 'Variable/Statistic'
Output Information:
method
is 'all' and how to use the function.if method.lower() == 'all':
result.append(f"\n✎ * NOTE: If method='distribution_analysis', aside from the overview above, the function RETURNS:")
result.append(f"■ 1 - Dataframe: where index are your variables, columns are all the calculated statistic (wide format for readability)")
result.append(f"☻ HOW TO: df = explore_num(yourdf, yourlist, method='distribution_analysis')")
Combine and Output Results:
output
parameter.combined_result = "\n".join(result)
if output.lower() == 'print':
print(combined_result)
return distribution_df
elif output.lower() == 'return':
return distribution_df
else:
raise ValueError("Invalid output method. Choose 'print' or 'return'.")
See the Full Function:
The full implementation can be found in the datasafari repository.
Description:
Method Functionality Idea:
The
distribution_analysis
method conducts an analysis of the distribution characteristics of numerical variables in the DataFrame. It computes descriptive statistics such as minimum, maximum, mean, median, mode, variance, and standard deviation. Additionally, it calculates skewness, kurtosis, and performs normality tests using the Shapiro-Wilk and Anderson-Darling tests.How it operates:
For each numerical variable specified in
numerical_variables
, the method computes descriptive statistics and performs normality testing after removing any missing values. It then appends the results, including descriptive statistics, skewness, kurtosis, and the outcome of normality tests, to the result list.Usage:
To conduct a distribution analysis of numerical variables using the
distribution_analysis
method:This method returns the distribution analysis results as a list.
Example:
Notes:
Note: Tips are provided to aid interpretation of skewness, kurtosis, and normality tests.