ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new evaluate_normality() method: 'anderson' #73

Closed ETA444 closed 6 months ago

ETA444 commented 8 months ago

Title: Enhancing Normality Testing with Anderson-Darling Method

Description: This update introduces the Anderson-Darling test as a new method for evaluating the normality of a distribution within groups defined by a categorical variable. The Anderson-Darling test is known for its versatility and ability to assess normality across various sample sizes, making it a valuable addition to the normality testing module. By incorporating the Anderson-Darling test, users can gain insights into the distribution tails and detect outliers or heavy-tailed distributions more effectively.

Example Usage:

import pandas as pd
import numpy as np

# Load example dataset
data = {
    'Group': np.random.choice(['A', 'B', 'C'], 100),
    'Data': np.random.normal(0, 1, 100)
}
df = pd.DataFrame(data)

# Evaluate normality using the Anderson-Darling test
normality_results = evaluate_normality(df, 'Data', 'Group', method='anderson')

Expected Outcome: By leveraging the Anderson-Darling test, users can obtain a comprehensive assessment of normality within different groups of data. This method provides insights into the distribution tails and is particularly useful for detecting outliers or heavy-tailed distributions. The incorporation of the Anderson-Darling test enhances the versatility and effectiveness of the normality testing module.

Additional Context: The Anderson-Darling test complements existing normality testing methods by offering a robust approach to assessing distribution normality. Its emphasis on distribution tails makes it suitable for a wide range of distributions and sample sizes, contributing to more accurate and reliable statistical analyses. This update enhances the functionality of the normality testing module, empowering users to make informed decisions in their data analysis processes.

ETA444 commented 6 months ago

Implementation Summary:

The 'anderson' method in the evaluate_normality() function utilizes the Anderson-Darling test to evaluate the normality of a numeric variable within groups defined by a grouping variable. The Anderson-Darling test is useful for detecting deviations from normality, particularly in the tails of the distribution, making it suitable for assessing outliers or heavy-tailed distributions.

Code Breakdown:

  1. Compute Anderson-Darling Statistic and Critical Values:

    • Purpose: Calculate the Anderson-Darling test statistic and critical values for each group.
    anderson_stats = [
       anderson(df[df[grouping_variable] == group][target_variable]).statistic 
       for group in groups
    ]
    anderson_critical_values = [
       anderson(df[df[grouping_variable] == group][target_variable]).critical_values[2] 
       for group in groups
    ]
    • Explanation:
      • The code block iterates through each group, performs the Anderson-Darling test on the target_variable, and stores the results.
      • The .statistic attribute gives the test statistic.
      • The .critical_values attribute provides the critical values; the index [2] corresponds to a significance level of 0.05.
  2. Determine Normality:

    • Purpose: Determine whether the variable in each group follows a normal distribution based on the Anderson-Darling test.
    anderson_normality = [
       True if c_val > anderson_stats[n] else False 
       for n, c_val in enumerate(anderson_critical_values)
    ]
    • Explanation:
      • The code block compares the test statistic with the critical value for each group.
      • If the critical value is greater than the test statistic, normality is assumed (True); otherwise, it's rejected (False).
  3. Prepare Output:

    • Purpose: Format the output for each group's test results and prepare console output.
    anderson_info = {
       group: {'stat': anderson_stats[n], 'p': anderson_critical_values[n], 'normality': anderson_normality[n]} 
       for n, group in enumerate(groups)
    }
    anderson_text = [
       f"Results for '{key}' group in variable ['{target_variable}']:\n  ➡ statistic: {value['stat']}\n  ➡ p-value: {value['p']}\n{(f'  ∴ Normality: Yes (H0 cannot be rejected)' if value['normality'] else f'  ∴ Normality: No (H0 rejected)')}\n\n"
       for key, value in anderson_info.items()
    ]
    anderson_title = f"< NORMALITY TESTING: ANDERSON-DARLING >\n\n"
    anderson_tip = "☻ Tip: The Anderson-Darling test is a versatile test that can be applied to any sample size and is especially useful for comparing against multiple distribution types, not just the normal. It places more emphasis on the tails of the distribution than the Shapiro-Wilk test, making it useful for detecting outliers or heavy-tailed distributions.\n"
    • Explanation:
      • The dictionary anderson_info holds the results for each group, with keys as group names and values as dictionaries containing the test statistic, critical value, and normality conclusion.
      • The anderson_text list formats these results for each group.
      • anderson_title and anderson_tip are used for console output headers and tips.
  4. Output Results and Return:

    • Purpose: Output the results and return the appropriate values based on the pipeline parameter.
    # saving info
    output_info['anderson'] = anderson_info
    normality_info['anderson_group_consensus'] = all(anderson_normality)
    
    # end it here if non-consensus method
    if method == 'anderson':
       print(anderson_title, *anderson_text, anderson_tip)
       return output_info if not pipeline else normality_info['anderson_group_consensus']
    • Explanation:
      • The results are saved in output_info and normality_info dictionaries.
      • If the method is 'anderson' (and not 'consensus'), the function prints the console output and returns the appropriate results based on the pipeline flag.

Link to Full Code: evaluate_normality.py.