ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new evaluate_normality() method: 'normaltest' #74

Closed ETA444 closed 6 months ago

ETA444 commented 8 months ago

Title: Introducing D'Agostino-Pearson Normality Test for Enhanced Normality Assessment

Description: This update integrates the D'Agostino-Pearson normality test, also known as the normaltest, into the normality testing module. The D'Agostino-Pearson test combines skewness and kurtosis to form a robust test statistic, making it particularly suitable for assessing departures from normality in medium to large sample sizes. By incorporating the D'Agostino-Pearson test, users can achieve a more comprehensive evaluation of normality, especially in scenarios involving asymmetry and tail thickness.

Example Usage:

import pandas as pd
import numpy as np

# Load example dataset
data = {
    'Group': np.random.choice(['A', 'B', 'C'], 100),
    'Data': np.random.normal(0, 1, 100)
}
df = pd.DataFrame(data)

# Evaluate normality using the D'Agostino-Pearson normality test
normality_results = evaluate_normality(df, 'Data', 'Group', method='normaltest')

Expected Outcome: By leveraging the D'Agostino-Pearson normality test, users can obtain a comprehensive assessment of normality within different groups of data. This method combines skewness and kurtosis to form a test statistic, providing insights into departures from normality involving asymmetry and tail thickness. The integration of the D'Agostino-Pearson test enhances the normality testing module, enabling users to make more informed decisions in their data analysis processes.

Additional Context: The D'Agostino-Pearson normality test offers a balanced approach to assessing normality, balancing sensitivity and specificity in medium to large sample sizes. Its ability to detect departures from normality involving asymmetry and tail thickness makes it a valuable addition to the normality testing toolkit. This update enhances the functionality of the normality testing module, empowering users to conduct rigorous and reliable analyses of their data.

ETA444 commented 6 months ago

Implementation Summary:

The 'normaltest' method in the evaluate_normality() function uses D'Agostino and Pearson's normality test to evaluate the normality of a numeric variable within groups defined by a grouping variable. This test assesses whether the sample follows a normal distribution by considering the sample's skewness and kurtosis.

Code Breakdown:

  1. Compute Normaltest Statistics and P-Values:

    • Purpose: Calculate the normaltest statistic and p-values for each group.
    normaltest_stats = [
       normaltest(df[df[grouping_variable] == group][target_variable]).statistic 
       for group in groups
    ]
    normaltest_pvals = [
       normaltest(df[df[grouping_variable] == group][target_variable]).pvalue 
       for group in groups
    ]
    • Explanation:
      • The code block iterates through each group, performs D'Agostino and Pearson's normality test on the target_variable, and stores the results.
      • The .statistic attribute provides the test statistic, and the .pvalue attribute gives the p-value.
  2. Determine Normality:

    • Purpose: Determine whether the variable in each group follows a normal distribution based on the normaltest results.
    normaltest_normality = [True if p > 0.05 else False for p in normaltest_pvals]
    • Explanation:
      • The code block checks the p-values for each group.
      • If the p-value is greater than 0.05, normality is assumed (True); otherwise, it's rejected (False).
  3. Prepare Output:

    • Purpose: Format the output for each group's test results and prepare console output.
    normaltest_info = {
       group: {'stat': normaltest_stats[n], 'p': normaltest_pvals[n], 'normality': normaltest_normality[n]} 
       for n, group in enumerate(groups)
    }
    normaltest_text = [
       f"Results for '{key}' group in variable ['{target_variable}']:\n  ➡ statistic: {value['stat']}\n  ➡ p-value: {value['p']}\n{(f'  ∴ Normality: Yes (H0 cannot be rejected)' if value['normality'] else f'  ∴ Normality: No (H0 rejected)')}\n\n"
       for key, value in normaltest_info.items()
    ]
    normaltest_title = f"< NORMALITY TESTING: D'AGOSTINO-PEARSON NORMALTEST >\n\n"
    normaltest_tip = "☻ Tip: The D'Agostino-Pearson normality test, or simply 'normaltest', is best applied when the sample size is larger, as it combines skewness and kurtosis to form a test statistic. This test is useful for detecting departures from normality that involve asymmetry and tail thickness, offering a good balance between sensitivity and specificity in medium to large sample sizes.\n"
    • Explanation:
      • The dictionary normaltest_info holds the results for each group, with keys as group names and values as dictionaries containing the test statistic, p-value, and normality conclusion.
      • The normaltest_text list formats these results for each group.
      • normaltest_title and normaltest_tip are used for console output headers and tips.
  4. Output Results and Return:

    • Purpose: Output the results and return the appropriate values based on the pipeline parameter.
    # saving info
    output_info['normaltest'] = normaltest_info
    normality_info['normaltest_group_consensus'] = all(normaltest_normality)
    
    # end it here if non-consensus method
    if method == 'normaltest':
       print(normaltest_title, *normaltest_text, normaltest_tip)
       return output_info if not pipeline else normality_info['normaltest_group_consensus']
    • Explanation:
      • The results are saved in output_info and normality_info dictionaries.
      • If the method is 'normaltest' (and not 'consensus'), the function prints the console output and returns the appropriate results based on the pipeline flag.

Link to Full Code: evaluate_normality.py.