Develop front-end of predict_hypothesis()

Implementation Summary:

The predict_hypothesis() function is designed to automatically select and execute the appropriate hypothesis tests based on input variables from a DataFrame. It combines the functionality of various modules in DataSafari to evaluate normality, variance, data types, and contingency tables, and uses two core functions (hypothesis_predictor_core_n and hypothesis_predictor_core_c) for hypothesis testing.

Code Breakdown:

Initial Setup and Error Handling:

Purpose: Validate input types and values to ensure correct data is being analyzed.

# Error Handling
# TypeErrors
if not isinstance(df, pd.DataFrame):
   raise TypeError("predict_hypothesis(): The 'df' parameter must be a pandas DataFrame.")

if not isinstance(var1, str) or not isinstance(var2, str):
   raise TypeError("predict_hypothesis(): The 'var1' and 'var2' parameters must be strings.")

if not isinstance(normality_method, str):
   raise TypeError("predict_hypothesis(): The 'normality_method' parameter must be a string.")

if not isinstance(variance_method, str):
   raise TypeError("predict_hypothesis(): The 'variance_method' parameter must be a string.")

if not isinstance(exact_tests_alternative, str):
   raise TypeError("predict_hypothesis(): The 'exact_tests_alternative' parameter must be a string.")

if not isinstance(yates_min_sample_size, int):
   raise TypeError("predict_hypothesis(): The 'yates_min_sample_size' parameter must be an integer.")

Explanation:
- The function first checks if the input types are correct and then validates the input values. It ensures the parameters are appropriate for further analysis.

# ValueErrors
if df.empty:
   raise ValueError("model_recommendation_core_inference(): The input DataFrame is empty.")

valid_normality_methods = ['shapiro', 'anderson', 'normaltest', 'lilliefors', 'consensus']
if normality_method.lower() not in valid_normality_methods:
   raise ValueError(f"predict_hypothesis(): Invalid 'normality_method' value. Expected one of {valid_normality_methods}, got '{normality_method}'.")

valid_variance_methods = ['levene', 'bartlett', 'fligner', 'consensus']
if variance_method.lower() not in valid_variance_methods:
   raise ValueError(f"predict_hypothesis(): Invalid 'variance_method' value. Expected one of {valid_variance_methods}, got '{variance_method}'.")

valid_alternatives = ['two-sided', 'less', 'greater']
if exact_tests_alternative.lower() not in valid_alternatives:
   raise ValueError(f"predict_hypothesis(): Invalid 'exact_tests_alternative' value. Expected one of {valid_alternatives}, got '{exact_tests_alternative}'.")

if yates_min_sample_size < 1:
   raise ValueError("predict_hypothesis(): The 'yates_min_sample_size' must be at least 1.")

Determine Hypothesis Testing Type:

Purpose: Determine whether to perform numerical or categorical hypothesis testing based on the input variables' data types.

# Main Function
# determine appropriate testing procedure and variable interpretation
data_types = evaluate_dtype(df, col_names=[var1, var2], output='dict')

if data_types[var1] == 'categorical' and data_types[var2] == 'numerical':
   hypothesis_testing = 'numerical'
   grouping_variable = var1
   target_variable = var2
   print(f"< INITIALIZING predict_hypothesis() >\n")
   print(f"Performing {hypothesis_testing} hypothesis testing with:")
   print(f"  ➡ Grouping variable: '{grouping_variable}' (with groups: {df[grouping_variable].unique()})\n  ➡ Target variable: '{target_variable}'\n")
   print(f"Output Contents:\n (1) Results of Normality Testing\n (2) Results of Variance Testing\n (3) Results of Hypothesis Testing\n\n")
elif data_types[var1] == 'numerical' and data_types[var2] == 'categorical':
   hypothesis_testing = 'numerical'
   grouping_variable = var2
   target_variable = var1
   print(f"< INITIALIZING predict_hypothesis() >\n")
   print(f"Performing {hypothesis_testing} hypothesis testing with:")
   print(f"  ➡ Grouping variable: '{grouping_variable}' (with groups: {df[grouping_variable].unique()})\n  ➡ Target variable: '{target_variable}'\n")
   print(f"Output Contents:\n (1) Results of Normality Testing\n (2) Results of Variance Testing\n (3) Results of Hypothesis Testing\n\n")
elif data_types[var1] == 'categorical' and data_types[var2] == 'categorical':
   hypothesis_testing = 'categorical'
   categorical_variable1 = var1
   categorical_variable2 = var2
   print(f"< INITIALIZING predict_hypothesis() >")
   print(f"Performing {hypothesis_testing} hypothesis testing with:\n")
   print(f"  ➡ Categorical variable 1: '{categorical_variable1}'\n  ➡ Categorical variable 2: '{categorical_variable2}'\n\n")
else:
   raise ValueError(f"predict_hypothesis(): Both of the provided variables are numerical.\n - To do numerical hypothesis testing, provide a numerical variable (target variable) and a categorical variable (grouping variable).\n - To do categorical hypothesis testing, provide two categorical variables.")

Explanation:
- The function determines whether to perform numerical or categorical hypothesis testing based on the types of var1 and var2. It prints out relevant details for the chosen hypothesis testing type.

Execute Hypothesis Testing:

Purpose: Perform appropriate hypothesis testing based on the determined type.

# perform appropriate hypothesis testing process
if hypothesis_testing == 'numerical':
   # evaluate test assumptions
   normality_bool = evaluate_normality(df, target_variable, grouping_variable, method=normality_method.lower(), pipeline=True)
   equal_variance_bool = evaluate_variance(df, target_variable, grouping_variable, normality_info=normality_bool, method=variance_method.lower(), pipeline=True)

   # perform hypothesis testing
   output_info = hypothesis_predictor_core_n(df, target_variable, grouping_variable, normality_bool, equal_variance_bool)
   return output_info
elif hypothesis_testing == 'categorical':
   # create contingency table
   contingency_table = pd.crosstab(df[categorical_variable1], df[categorical_variable2])

   # evaluate test criteria
   chi2_viability, yates_correction_viability, barnard_viability, boschloo_viability, fisher_viability = evaluate_contingency_table(contingency_table, min_sample_size_yates=yates_min_sample_size, pipeline=True, quiet=True)

   # perform hypothesis testing
   output_info = hypothesis_predictor_core_c(contingency_table, chi2_viability, barnard_viability, boschloo_viability, fisher_viability, yates_correction_viability, alternative=exact_tests_alternative.lower())
   return output_info

Explanation:
- The function first evaluates the assumptions for normality and equal variance for numerical testing.
- Then, it uses hypothesis_predictor_core_n for numerical hypothesis testing or hypothesis_predictor_core_c for categorical hypothesis testing based on the determined type.
- It returns the results in the output_info dictionary containing the relevant information for each type of hypothesis test.

Link to Full Code: predict_hypothesis.py.

ETA444 / datasafari

Develop front-end of predict_hypothesis() #97