Closed ETA444 closed 6 months ago
Implementation Summary:
The predict_hypothesis()
function is designed to automatically select and execute the appropriate hypothesis tests based on input variables from a DataFrame. It combines the functionality of various modules in DataSafari to evaluate normality, variance, data types, and contingency tables, and uses two core functions (hypothesis_predictor_core_n
and hypothesis_predictor_core_c
) for hypothesis testing.
Code Breakdown:
Initial Setup and Error Handling:
# Error Handling
# TypeErrors
if not isinstance(df, pd.DataFrame):
raise TypeError("predict_hypothesis(): The 'df' parameter must be a pandas DataFrame.")
if not isinstance(var1, str) or not isinstance(var2, str):
raise TypeError("predict_hypothesis(): The 'var1' and 'var2' parameters must be strings.")
if not isinstance(normality_method, str):
raise TypeError("predict_hypothesis(): The 'normality_method' parameter must be a string.")
if not isinstance(variance_method, str):
raise TypeError("predict_hypothesis(): The 'variance_method' parameter must be a string.")
if not isinstance(exact_tests_alternative, str):
raise TypeError("predict_hypothesis(): The 'exact_tests_alternative' parameter must be a string.")
if not isinstance(yates_min_sample_size, int):
raise TypeError("predict_hypothesis(): The 'yates_min_sample_size' parameter must be an integer.")
# ValueErrors
if df.empty:
raise ValueError("model_recommendation_core_inference(): The input DataFrame is empty.")
valid_normality_methods = ['shapiro', 'anderson', 'normaltest', 'lilliefors', 'consensus']
if normality_method.lower() not in valid_normality_methods:
raise ValueError(f"predict_hypothesis(): Invalid 'normality_method' value. Expected one of {valid_normality_methods}, got '{normality_method}'.")
valid_variance_methods = ['levene', 'bartlett', 'fligner', 'consensus']
if variance_method.lower() not in valid_variance_methods:
raise ValueError(f"predict_hypothesis(): Invalid 'variance_method' value. Expected one of {valid_variance_methods}, got '{variance_method}'.")
valid_alternatives = ['two-sided', 'less', 'greater']
if exact_tests_alternative.lower() not in valid_alternatives:
raise ValueError(f"predict_hypothesis(): Invalid 'exact_tests_alternative' value. Expected one of {valid_alternatives}, got '{exact_tests_alternative}'.")
if yates_min_sample_size < 1:
raise ValueError("predict_hypothesis(): The 'yates_min_sample_size' must be at least 1.")
Determine Hypothesis Testing Type:
# Main Function
# determine appropriate testing procedure and variable interpretation
data_types = evaluate_dtype(df, col_names=[var1, var2], output='dict')
if data_types[var1] == 'categorical' and data_types[var2] == 'numerical':
hypothesis_testing = 'numerical'
grouping_variable = var1
target_variable = var2
print(f"< INITIALIZING predict_hypothesis() >\n")
print(f"Performing {hypothesis_testing} hypothesis testing with:")
print(f" ➡ Grouping variable: '{grouping_variable}' (with groups: {df[grouping_variable].unique()})\n ➡ Target variable: '{target_variable}'\n")
print(f"Output Contents:\n (1) Results of Normality Testing\n (2) Results of Variance Testing\n (3) Results of Hypothesis Testing\n\n")
elif data_types[var1] == 'numerical' and data_types[var2] == 'categorical':
hypothesis_testing = 'numerical'
grouping_variable = var2
target_variable = var1
print(f"< INITIALIZING predict_hypothesis() >\n")
print(f"Performing {hypothesis_testing} hypothesis testing with:")
print(f" ➡ Grouping variable: '{grouping_variable}' (with groups: {df[grouping_variable].unique()})\n ➡ Target variable: '{target_variable}'\n")
print(f"Output Contents:\n (1) Results of Normality Testing\n (2) Results of Variance Testing\n (3) Results of Hypothesis Testing\n\n")
elif data_types[var1] == 'categorical' and data_types[var2] == 'categorical':
hypothesis_testing = 'categorical'
categorical_variable1 = var1
categorical_variable2 = var2
print(f"< INITIALIZING predict_hypothesis() >")
print(f"Performing {hypothesis_testing} hypothesis testing with:\n")
print(f" ➡ Categorical variable 1: '{categorical_variable1}'\n ➡ Categorical variable 2: '{categorical_variable2}'\n\n")
else:
raise ValueError(f"predict_hypothesis(): Both of the provided variables are numerical.\n - To do numerical hypothesis testing, provide a numerical variable (target variable) and a categorical variable (grouping variable).\n - To do categorical hypothesis testing, provide two categorical variables.")
var1
and var2
. It prints out relevant details for the chosen hypothesis testing type.Execute Hypothesis Testing:
# perform appropriate hypothesis testing process
if hypothesis_testing == 'numerical':
# evaluate test assumptions
normality_bool = evaluate_normality(df, target_variable, grouping_variable, method=normality_method.lower(), pipeline=True)
equal_variance_bool = evaluate_variance(df, target_variable, grouping_variable, normality_info=normality_bool, method=variance_method.lower(), pipeline=True)
# perform hypothesis testing
output_info = hypothesis_predictor_core_n(df, target_variable, grouping_variable, normality_bool, equal_variance_bool)
return output_info
elif hypothesis_testing == 'categorical':
# create contingency table
contingency_table = pd.crosstab(df[categorical_variable1], df[categorical_variable2])
# evaluate test criteria
chi2_viability, yates_correction_viability, barnard_viability, boschloo_viability, fisher_viability = evaluate_contingency_table(contingency_table, min_sample_size_yates=yates_min_sample_size, pipeline=True, quiet=True)
# perform hypothesis testing
output_info = hypothesis_predictor_core_c(contingency_table, chi2_viability, barnard_viability, boschloo_viability, fisher_viability, yates_correction_viability, alternative=exact_tests_alternative.lower())
return output_info
hypothesis_predictor_core_n
for numerical hypothesis testing or hypothesis_predictor_core_c
for categorical hypothesis testing based on the determined type.output_info
dictionary containing the relevant information for each type of hypothesis test.Link to Full Code: predict_hypothesis.py.
Title: Automated Hypothesis Testing Function
Description: The
predict_hypothesis
function serves as an automated tool for selecting and executing appropriate hypothesis tests based on input variables from a DataFrame.It combines the functionality of various modules in DataSafari:
evaluate_normality()
evaluate_variance()
evaluate_dtype()
evaluate_contingency_table()
As well as the two back-end cores that do the hypothesis testing (not available in public API):
predictor_core_n()
predictor_core_c()