Closed ETA444 closed 5 months ago
Implementation Summary:
The hypothesis_predictor_core_n()
function conducts hypothesis testing on numerical data, choosing appropriate tests based on data characteristics. It performs hypothesis testing between groups defined by a categorical variable for a numerical target variable. The function selects the appropriate statistical test based on the normality of the data and the homogeneity of variances across groups, utilizing t-tests, Mann-Whitney U tests, ANOVA, or Kruskal-Wallis tests as appropriate.
Code Breakdown:
Initial Setup and Error Handling:
# Error Handling
# TypeErrors
if not isinstance(df, pd.DataFrame):
raise TypeError("predictor_core_numerical(): The 'df' parameter must be a pandas DataFrame.")
if not isinstance(target_variable, str):
raise TypeError("predictor_core_numerical(): The 'target_variable' must be a string.")
if not isinstance(grouping_variable, str):
raise TypeError("predictor_core_numerical(): The 'grouping_variable' must be a string.")
if not isinstance(normality_bool, bool):
raise TypeError("predictor_core_numerical(): The 'normality_bool' must be a boolean.")
if not isinstance(equal_variances_bool, bool):
raise TypeError("predictor_core_numerical(): The 'equal_variances_bool' must be a boolean.")
# ValueErrors
if df.empty:
raise ValueError("predictor_core_n(): The input DataFrame is empty.")
if target_variable not in df.columns:
raise ValueError(f"predictor_core_n(): The target variable '{target_variable}' was not found in the DataFrame.")
if grouping_variable not in df.columns:
raise ValueError(f"predictor_core_n(): The grouping variable '{grouping_variable}' was not found in the DataFrame.")
target_variable_is_numerical = evaluate_dtype(df, [target_variable], output='list_n')[0]
if not target_variable_is_numerical:
raise ValueError(f"predictor_core_n(): The target variable '{target_variable}' must be a numerical variable.")
grouping_variable_is_categorical = evaluate_dtype(df, [grouping_variable], output='list_c')[0]
if not grouping_variable_is_categorical:
raise ValueError(f"predictor_core_n(): The grouping variable '{grouping_variable}' must be a categorical variable.")
Main Function:
# Main Function
groups = df[grouping_variable].unique().tolist()
samples = [df[df[grouping_variable] == group][target_variable] for group in groups]
# define output object
output_info = {}
if len(samples) == 2: # two sample testing
if normality_bool and equal_variances_bool: # parametric testing
stat, p_val = ttest_ind(*samples)
test_name = 'Independent Samples T-Test'
conclusion = "The data do not provide sufficient evidence to conclude a significant difference between the group means, failing to reject the null hypothesis." if p_val > 0.05 else "The data provide sufficient evidence to conclude a significant difference between the group means, rejecting the null hypothesis."
output_info['ttest_ind'] = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'test_name': test_name, 'normality': normality_bool, 'equal_variance': equal_variances_bool}
else: # non-parametric testing
stat, p_val = mannwhitneyu(*samples)
test_name = 'Mann-Whitney U Rank Test (Two Independent Samples)'
conclusion = "The data do not provide sufficient evidence to conclude a significant difference in group distributions, failing to reject the null hypothesis." if p_val > 0.05 else "The data provide sufficient evidence to conclude a significant difference in group distributions, rejecting the null hypothesis."
output_info['mannwhitneyu'] = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'test_name': test_name, 'normality': normality_bool, 'equal_variance': equal_variances_bool}
else: # more than two samples
if normality_bool and equal_variances_bool:
stat, p_val = f_oneway(*samples)
test_name = f'One-way ANOVA (with {len(samples)} groups)'
conclusion = "The data do not provide sufficient evidence to conclude a significant difference among the group means, failing to reject the null hypothesis." if p_val > 0.05 else "The data provide sufficient evidence to conclude a significant difference among the group means, rejecting the null hypothesis."
output_info['f_oneway'] = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'test_name': test_name, 'normality': normality_bool, 'equal_variance': equal_variances_bool}
else:
stat, p_val = kruskal(*samples)
test_name = f'Kruskal-Wallis H-test (with {len(samples)} groups)'
conclusion = "The data do not provide sufficient evidence to conclude a significant difference among the group distributions, failing to reject the null hypothesis." if p_val > 0.05 else "The data provide sufficient evidence to conclude a significant difference among the group distributions, rejecting the null hypothesis."
output_info['kruskal'] = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'test_name': test_name, 'normality': normality_bool, 'equal_variance': equal_variances_bool}
output_info
dictionary.Constructing Console Output:
# construct console output and return
print(f"< HYPOTHESIS TESTING: {test_name}>\nBased on:\n ➡ Normality assumption: {'✔' if normality_bool else '✘'}\n ➡ Equal variances assumption: {'✔' if equal_variances_bool else '✘'}\n ➡ Nr. of Groups: {len(samples)} groups\n ∴ predict_hypothesis() is performing {test_name}:\n")
print(f"Results of {test_name}:\n ➡ statistic: {stat}\n ➡ p-value: {p_val}\n ∴ Conclusion: {conclusion}\n")
return output_info
output_info
dictionary containing all the relevant information.Link to Full Code: predict_hypothesis.py.
Title: Numerical Hypothesis Testing Function - Back End Core of
predict_hypothesis()
Description: The
hypothesis_predictor_core_n
function conducts hypothesis testing on numerical data with categorical grouping variables, selecting appropriate statistical tests based on data characteristics such as normality and equal variances. This enhancement aims to improve the function's flexibility, error handling, and informative output to facilitate hypothesis testing in data analysis workflows.Proposed Changes:
Expected Outcome: With this enhancement, the
hypothesis_predictor_core_n
function will become a more robust and user-friendly tool for conducting hypothesis testing on numerical data with categorical grouping variables. It will provide clearer guidance on the appropriate statistical tests to use based on the characteristics of the data, helping users make informed decisions in their data analysis tasks.Additional Context: Improving the functionality of the
hypothesis_predictor_core_n
function aligns with our goal of providing high-quality statistical analysis tools that support efficient and reliable data analysis workflows. This enhancement addresses common needs and challenges encountered in hypothesis testing scenarios involving numerical data and categorical grouping variables, contributing to the overall usability and effectiveness of our data analysis toolkit.