ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new evaluate_normality() method: 'lilliefors' #75

Closed ETA444 closed 6 months ago

ETA444 commented 8 months ago

Title: Introducing Lilliefors' Test for Normality Assessment

Description: This update incorporates the Lilliefors test into the normality testing module, providing users with an additional tool for assessing the normality of their data. Lilliefors' test is an adaptation of the Kolmogorov-Smirnov test specifically designed for normality testing, offering the advantage of not requiring knowledge of the mean and variance parameters. It is particularly well-suited for small to moderately sized samples and is sensitive to deviations from normality in the center of the distribution. The inclusion of Lilliefors' test enhances the module's capability to detect deviations from normality, providing users with a comprehensive assessment of their data's distribution.

Example Usage:

import pandas as pd
import numpy as np

# Load example dataset
data = {
    'Group': np.random.choice(['A', 'B', 'C'], 100),
    'Data': np.random.normal(0, 1, 100)
}
df = pd.DataFrame(data)

# Evaluate normality using Lilliefors' test
normality_results = evaluate_normality(df, 'Data', 'Group', method='lilliefors')

Expected Outcome: By leveraging Lilliefors' test, users can assess the normality of their data with increased accuracy, especially in scenarios involving small to moderately sized samples. Lilliefors' test is sensitive to deviations from normality in the center of the distribution, making it complementary to other normality tests. The integration of Lilliefors' test enhances the normality testing module, providing users with a more comprehensive toolkit for evaluating the distribution of their data.

Additional Context: The addition of Lilliefors' test to the normality testing module enhances its versatility and reliability. Lilliefors' test offers a practical solution for assessing normality in datasets where the mean and variance parameters are not known. Its sensitivity to deviations from normality in the central part of the distribution makes it a valuable asset in the normality assessment process. This update underscores our commitment to providing users with robust and comprehensive tools for statistical analysis and hypothesis testing.

ETA444 commented 6 months ago

Implementation Summary:

The 'lilliefors' method in the evaluate_normality() function applies the Lilliefors test, which is an adaptation of the Kolmogorov-Smirnov test for normality. This test is especially useful for small to moderately sized samples and does not require the mean and variance to be known parameters. It is particularly sensitive to deviations from normality in the center of the distribution.

Code Breakdown:

  1. Calculate Lilliefors Statistic and P-values:

    • Purpose: Compute the Lilliefors test statistic and corresponding p-values for each group.
    lilliefors_stats = [
       lilliefors(df[df[grouping_variable] == group][target_variable])[0] 
       for group in groups
    ]
    lilliefors_pvals = [
       lilliefors(df[df[grouping_variable] == group][target_variable])[1] 
       for group in groups
    ]
    • Explanation:
      • The code block iterates through each group and applies the Lilliefors test to the target_variable.
      • The test returns a tuple, where the first element is the test statistic and the second element is the p-value.
      • The code extracts these values and stores them in lilliefors_stats and lilliefors_pvals.
  2. Determine Normality:

    • Purpose: Determine whether each group follows a normal distribution based on the Lilliefors test results.
    lilliefors_normality = [p > 0.05 for p in lilliefors_pvals]
    • Explanation:
      • The code block assesses the normality of each group by checking if the p-value is greater than 0.05 (True indicates normality, False indicates non-normality).
  3. Prepare Output:

    • Purpose: Format the output for each group's test results and prepare console output.
    lilliefors_info = {
       group: {
           'stat': lilliefors_stats[n], 
           'p': lilliefors_pvals[n], 
           'normality': lilliefors_normality[n]
       } for n, group in enumerate(groups)
    }
    lilliefors_text = [
       f"Results for '{key}' group in variable ['{target_variable}']:\n  ➡ statistic: {value['stat']}\n  ➡ p-value: {value['p']}\n{(f'  ∴ Normality: Yes (H0 cannot be rejected)' if value['normality'] else f'  ∴ Normality: No (H0 rejected)')}\n\n" 
       for key, value in lilliefors_info.items()
    ]
    lilliefors_title = f"< NORMALITY TESTING: LILLIEFORS' TEST >\n\n"
    lilliefors_tip = "☻ Tip: The Lilliefors test is an adaptation of the Kolmogorov-Smirnov test for normality with the benefit of not requiring the mean and variance to be known parameters. It's particularly useful for small to moderately sized samples and is sensitive to deviations from normality in the center of the distribution rather than the tails. This makes it complementary to tests like the Anderson-Darling when a comprehensive assessment of normality is needed.\n"
    • Explanation:
      • The dictionary lilliefors_info holds the results for each group, with group names as keys and dictionaries containing the test statistic, p-value, and normality conclusion as values.
      • The lilliefors_text list formats these results for each group.
      • lilliefors_title and lilliefors_tip are used for console output headers and tips.
  4. Output Results and Return:

    • Purpose: Output the results and return the appropriate values based on the pipeline parameter.
    # saving info
    output_info['lilliefors'] = lilliefors_info
    normality_info['lilliefors_group_consensus'] = all(lilliefors_normality)
    
    # end it here if non-consensus method
    if method == 'lilliefors':
       print(lilliefors_title, *lilliefors_text, lilliefors_tip)
       return output_info if not pipeline else normality_info['lilliefors_group_consensus']
    • Explanation:
      • The results are saved in output_info and normality_info dictionaries.
      • If the method is 'lilliefors' (and not 'consensus'), the function prints the console output and returns the appropriate results based on the pipeline flag.

Link to Full Code: evaluate_normality.py.