ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new explore_num() method: 'distribution_analysis' #28

Closed ETA444 closed 7 months ago

ETA444 commented 9 months ago

Description:


Method Functionality Idea:

The distribution_analysis method conducts an analysis of the distribution characteristics of numerical variables in the DataFrame. It computes descriptive statistics such as minimum, maximum, mean, median, mode, variance, and standard deviation. Additionally, it calculates skewness, kurtosis, and performs normality tests using the Shapiro-Wilk and Anderson-Darling tests.

How it operates:

For each numerical variable specified in numerical_variables, the method computes descriptive statistics and performs normality testing after removing any missing values. It then appends the results, including descriptive statistics, skewness, kurtosis, and the outcome of normality tests, to the result list.

Usage:

To conduct a distribution analysis of numerical variables using the distribution_analysis method:

explore_num(df, numerical_variables, method='distribution_analysis')

This method returns the distribution analysis results as a list.

Example:

distribution_results = explore_num(df, ['feature1', 'feature2', 'feature3'], method='distribution_analysis', output='print')
print(distribution_results)

Notes:


Note: Tips are provided to aid interpretation of skewness, kurtosis, and normality tests.

ETA444 commented 7 months ago

Implementation Summary:

The distribution_analysis method within the explore_num() function is designed to analyse the distribution characteristics of numerical variables, including skewness, kurtosis, and normality tests. The goal is to provide insights into the distribution of the data, which is crucial for certain statistical analyses.

Purpose:

The purpose of this implementation is to analyse the distribution characteristics of numerical data, including skewness, kurtosis, and normality tests.

Code Breakdown:

  1. Title and Overview:

    • Purpose: To provide a title and overview for the distribution analysis section.
    result.append(f"\n<<______DISTRIBUTION ANALYSIS______>>\n")
    result.append(f"✎ Overview of Results*\n")
  2. Define and Initialize Variables:

    • Purpose: To define the statistics to be calculated and initialize a dictionary to store the results.
    stats_functions = [
       'min', 'max', 'mean', 'median', 'mode', 'variance',
       'std_dev', 'skewness', 'kurtosis', 'shapiro_p', 'anderson_stat'
    ]
    stats_dict = {stat: [] for stat in stats_functions}
  3. Calculate Statistics:

    • Purpose: To calculate the descriptive statistics, skewness, kurtosis, and normality tests for each variable.
    for variable_name in numerical_variables:
       data = df[variable_name].dropna()
       var_min, var_max = data.min(), data.max()
       mean = data.mean()
       median = data.median()
       mode = data.mode().tolist()
       variance = data.var()
       std_dev = data.std()
       skewness = skew(data)
       kurt = kurtosis(data)
       shapiro_stat, shapiro_p = shapiro(data)
       anderson_stat = anderson(data)
    
       stats_dict['min'].append(var_min)
       stats_dict['max'].append(var_max)
       stats_dict['mean'].append(mean)
       stats_dict['median'].append(median)
       stats_dict['mode'].append(mode[0] if mode else pd.NA)
       stats_dict['variance'].append(variance)
       stats_dict['std_dev'].append(std_dev)
       stats_dict['skewness'].append(skewness)
       stats_dict['kurtosis'].append(kurt)
       stats_dict['shapiro_p'].append(shapiro_p)
       stats_dict['anderson_stat'].append(anderson_stat.statistic)
  4. Interpretation Tips:

    • Purpose: To interpret the results and provide tips for skewness, kurtosis, and normality tests.
    for variable_name in numerical_variables:
       result.append(f"\n< Distribution Analysis Summary for: ['{variable_name}'] >\n")
       result.append(f"➡ Min: {stats_dict['min'][-1]:.2f}\n➡ Max: {stats_dict['max'][-1]:.2f}\n➡ Mean: {stats_dict['mean'][-1]:.2f}\n➡ Median: {stats_dict['median'][-1]:.2f}\n➡ Mode(s): {stats_dict['mode'][-1]}")
       result.append(f"➡ Variance: {stats_dict['variance'][-1]:.2f}\n➡ Standard Deviation: {stats_dict['std_dev'][-1]:.2f}")
       result.append(f"\n➡ Skewness: {stats_dict['skewness'][-1]:.2f}\n   ☻ Tip: Symmetric if ~0, left-skewed if <0, right-skewed if >0")
       result.append(f"\n➡ Kurtosis: {stats_dict['kurtosis'][-1]:.2f}\n   ☻ Tip: Mesokurtic if ~0, Platykurtic if <0, Leptokurtic if >0)")
       result.append(f"\n★ Shapiro-Wilk Test for Normality:\n  • H0: Data is normally distributed.\n  • H1: Data is not normally distributed.\n   ➡ p-value = {stats_dict['shapiro_p'][-1]:.4f}\n   {'✘ Conclusion: Not normally distributed.' if stats_dict['shapiro_p'][-1] < 0.05 else '✔ Conclusion: Normally distributed.'}\n")
       result.append(f"\n★ Anderson-Darling Test for Normality:\n   ➡ statistic = {stats_dict['anderson_stat'][-1]:.4f}\n   ➡ significance levels = {anderson(data).significance_level}\n   ➡ critical values = {anderson(data).critical_values}\n   ☻ Tip: Compare the statistic to critical values. Data is likely not normally distributed if the statistic > critical value.\n")
  5. Construct DataFrame:

    • Purpose: To construct a DataFrame from the statistics dictionary for easy readability.
    distribution_df = pd.concat(
       [pd.DataFrame({variable_name: [stats_dict[stat][i]] for i, variable_name in enumerate(numerical_variables)}, index=[stat])
        for stat, values in stats_dict.items()], axis=0
    ).T
    distribution_df.columns = stats_functions
    distribution_df.index.name = 'Variable/Statistic'
  6. Output Information:

    • Purpose: To add notes about the output when method is 'all' and how to use the function.
    if method.lower() == 'all':
       result.append(f"\n✎ * NOTE: If method='distribution_analysis', aside from the overview above, the function RETURNS:")
       result.append(f"■ 1 - Dataframe: where index are your variables, columns are all the calculated statistic (wide format for readability)")
       result.append(f"☻ HOW TO: df = explore_num(yourdf, yourlist, method='distribution_analysis')")
  7. Combine and Output Results:

    • Purpose: To combine all results into a string and determine the appropriate output based on the output parameter.
    combined_result = "\n".join(result)
    
    if output.lower() == 'print':
       print(combined_result)
       return distribution_df
    elif output.lower() == 'return':
       return distribution_df
    else:
       raise ValueError("Invalid output method. Choose 'print' or 'return'.")

See the Full Function:

The full implementation can be found in the datasafari repository.