Closed ETA444 closed 6 months ago
Implementation Summary:
The 'consensus'
method in the evaluate_normality()
function combines the results of multiple normality tests to reach a consensus decision on the normality of a numeric variable within groups defined by a categorical variable. The function uses several well-known tests (Shapiro-Wilk, Anderson-Darling, D'Agostino-Pearson, and Lilliefors) and aggregates their results to decide if the data is normal.
Code Breakdown:
Shapiro-Wilk Test:
if method in ['shapiro', 'consensus']:
shapiro_stats = [shapiro(df[df[grouping_variable] == group][target_variable]).statistic for group in groups]
shapiro_pvals = [shapiro(df[df[grouping_variable] == group][target_variable]).pvalue for group in groups]
shapiro_normality = [p > 0.05 for p in shapiro_pvals]
shapiro_info = {group: {'stat': shapiro_stats[n], 'p': shapiro_pvals[n], 'normality': shapiro_normality[n]} for n, group in enumerate(groups)}
output_info['shapiro'] = shapiro_info
normality_info['shapiro_group_consensus'] = all(shapiro_normality)
Anderson-Darling Test:
if method in ['anderson', 'consensus']:
anderson_stats = [anderson(df[df[grouping_variable] == group][target_variable]).statistic for group in groups]
anderson_critical_values = [anderson(df[df[grouping_variable] == group][target_variable]).critical_values[2] for group in groups]
anderson_normality = [c_val > anderson_stats[n] for n, c_val in enumerate(anderson_critical_values)]
anderson_info = {group: {'stat': anderson_stats[n], 'p': anderson_critical_values[n], 'normality': anderson_normality[n]} for n, group in enumerate(groups)}
output_info['anderson'] = anderson_info
normality_info['anderson_group_consensus'] = all(anderson_normality)
D'Agostino-Pearson Test:
if method in ['normaltest', 'consensus']:
normaltest_stats = [normaltest(df[df[grouping_variable] == group][target_variable]).statistic for group in groups]
normaltest_pvals = [normaltest(df[df[grouping_variable] == group][target_variable]).pvalue for group in groups]
normaltest_normality = [p > 0.05 for p in normaltest_pvals]
normaltest_info = {group: {'stat': normaltest_stats[n], 'p': normaltest_pvals[n], 'normality': normaltest_normality[n]} for n, group in enumerate(groups)}
output_info['normaltest'] = normaltest_info
normality_info['normaltest_group_consensus'] = all(normaltest_normality)
Lilliefors Test:
if method in ['lilliefors', 'consensus']:
lilliefors_stats = [lilliefors(df[df[grouping_variable] == group][target_variable])[0] for group in groups]
lilliefors_pvals = [lilliefors(df[df[grouping_variable] == group][target_variable])[1] for group in groups]
lilliefors_normality = [p > 0.05 for p in lilliefors_pvals]
lilliefors_info = {group: {'stat': lilliefors_stats[n], 'p': lilliefors_pvals[n], 'normality': lilliefors_normality[n]} for n, group in enumerate(groups)}
output_info['lilliefors'] = lilliefors_info
normality_info['lilliefors_group_consensus'] = all(lilliefors_normality)
Consensus:
if method == 'consensus':
true_count = [group_normality_consensus for group_normality_consensus in normality_info.values()].count(True)
false_count = [group_normality_consensus for group_normality_consensus in normality_info.values()].count(False)
half_point = len(normality_info) / 2
if true_count > half_point:
consensus_percent = (true_count / len(normality_info)) * 100
normality_consensus_result = f" ➡ Result: Consensus is reached.\n ➡ {consensus_percent}% of tests suggest Normality."
normality_consensus = True
elif true_count < half_point:
consensus_percent = (false_count / len(normality_info)) * 100
normality_consensus_result = f" ➡ Result: Consensus is reached.\n ➡ {consensus_percent}% of tests suggest Non-Normality."
normality_consensus = False
else:
normality_consensus_result = " ➡ Result: Consensus is not reached. (50% Normality / 50% Non-normality)"
normality_consensus = all([normality_info['shapiro_group_consensus'], normality_info['anderson_group_consensus']])
print(f"< NORMALITY TESTING: CONSENSUS >\n{normality_consensus_result}\n")
return output_info if not pipeline else normality_consensus
pipeline
parameter.Link to Full Code: evaluate_normality.py.
Title: Implementing Consensus Method for Normality Assessment
Description: This update introduces a consensus method for evaluating the normality of data distribution, leveraging multiple normality tests including the Shapiro-Wilk test, Anderson-Darling test, D'Agostino-Pearson test, and Lilliefors test. The consensus method aggregates the results of these tests to provide a robust conclusion regarding the normality of the dataset. By requiring more than 50% agreement among the tests, the consensus approach enhances the reliability of normality assessments, particularly in cases where individual tests may yield conflicting results.
Example Usage:
Expected Outcome: By employing the consensus method, users can obtain a comprehensive assessment of the normality of their dataset. The consensus approach considers the collective results of multiple normality tests, providing greater confidence in the determination of whether the data follows a normal distribution. The integration of the consensus method enhances the reliability and robustness of the normality assessment process, facilitating more informed statistical analyses and decision-making.
Additional Context: The addition of the consensus method to the normality assessment module enhances its capability to provide accurate and reliable evaluations of data normality. By combining the results of multiple normality tests, the consensus approach offers a more nuanced understanding of the distribution characteristics of the dataset. This update underscores our commitment to delivering advanced statistical tools that empower users to conduct rigorous analyses and draw meaningful insights from their data.