ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Develop front-end of predict_hypothesis() #97

Closed ETA444 closed 6 months ago

ETA444 commented 8 months ago

Title: Automated Hypothesis Testing Function

Description: The predict_hypothesis function serves as an automated tool for selecting and executing appropriate hypothesis tests based on input variables from a DataFrame.

It combines the functionality of various modules in DataSafari:

As well as the two back-end cores that do the hypothesis testing (not available in public API):

ETA444 commented 6 months ago

Implementation Summary:

The predict_hypothesis() function is designed to automatically select and execute the appropriate hypothesis tests based on input variables from a DataFrame. It combines the functionality of various modules in DataSafari to evaluate normality, variance, data types, and contingency tables, and uses two core functions (hypothesis_predictor_core_n and hypothesis_predictor_core_c) for hypothesis testing.

Code Breakdown:

  1. Initial Setup and Error Handling:

    • Purpose: Validate input types and values to ensure correct data is being analyzed.
    # Error Handling
    # TypeErrors
    if not isinstance(df, pd.DataFrame):
       raise TypeError("predict_hypothesis(): The 'df' parameter must be a pandas DataFrame.")
    
    if not isinstance(var1, str) or not isinstance(var2, str):
       raise TypeError("predict_hypothesis(): The 'var1' and 'var2' parameters must be strings.")
    
    if not isinstance(normality_method, str):
       raise TypeError("predict_hypothesis(): The 'normality_method' parameter must be a string.")
    
    if not isinstance(variance_method, str):
       raise TypeError("predict_hypothesis(): The 'variance_method' parameter must be a string.")
    
    if not isinstance(exact_tests_alternative, str):
       raise TypeError("predict_hypothesis(): The 'exact_tests_alternative' parameter must be a string.")
    
    if not isinstance(yates_min_sample_size, int):
       raise TypeError("predict_hypothesis(): The 'yates_min_sample_size' parameter must be an integer.")
    • Explanation:
      • The function first checks if the input types are correct and then validates the input values. It ensures the parameters are appropriate for further analysis.
    # ValueErrors
    if df.empty:
       raise ValueError("model_recommendation_core_inference(): The input DataFrame is empty.")
    
    valid_normality_methods = ['shapiro', 'anderson', 'normaltest', 'lilliefors', 'consensus']
    if normality_method.lower() not in valid_normality_methods:
       raise ValueError(f"predict_hypothesis(): Invalid 'normality_method' value. Expected one of {valid_normality_methods}, got '{normality_method}'.")
    
    valid_variance_methods = ['levene', 'bartlett', 'fligner', 'consensus']
    if variance_method.lower() not in valid_variance_methods:
       raise ValueError(f"predict_hypothesis(): Invalid 'variance_method' value. Expected one of {valid_variance_methods}, got '{variance_method}'.")
    
    valid_alternatives = ['two-sided', 'less', 'greater']
    if exact_tests_alternative.lower() not in valid_alternatives:
       raise ValueError(f"predict_hypothesis(): Invalid 'exact_tests_alternative' value. Expected one of {valid_alternatives}, got '{exact_tests_alternative}'.")
    
    if yates_min_sample_size < 1:
       raise ValueError("predict_hypothesis(): The 'yates_min_sample_size' must be at least 1.")
  2. Determine Hypothesis Testing Type:

    • Purpose: Determine whether to perform numerical or categorical hypothesis testing based on the input variables' data types.
    # Main Function
    # determine appropriate testing procedure and variable interpretation
    data_types = evaluate_dtype(df, col_names=[var1, var2], output='dict')
    
    if data_types[var1] == 'categorical' and data_types[var2] == 'numerical':
       hypothesis_testing = 'numerical'
       grouping_variable = var1
       target_variable = var2
       print(f"< INITIALIZING predict_hypothesis() >\n")
       print(f"Performing {hypothesis_testing} hypothesis testing with:")
       print(f"  ➡ Grouping variable: '{grouping_variable}' (with groups: {df[grouping_variable].unique()})\n  ➡ Target variable: '{target_variable}'\n")
       print(f"Output Contents:\n (1) Results of Normality Testing\n (2) Results of Variance Testing\n (3) Results of Hypothesis Testing\n\n")
    elif data_types[var1] == 'numerical' and data_types[var2] == 'categorical':
       hypothesis_testing = 'numerical'
       grouping_variable = var2
       target_variable = var1
       print(f"< INITIALIZING predict_hypothesis() >\n")
       print(f"Performing {hypothesis_testing} hypothesis testing with:")
       print(f"  ➡ Grouping variable: '{grouping_variable}' (with groups: {df[grouping_variable].unique()})\n  ➡ Target variable: '{target_variable}'\n")
       print(f"Output Contents:\n (1) Results of Normality Testing\n (2) Results of Variance Testing\n (3) Results of Hypothesis Testing\n\n")
    elif data_types[var1] == 'categorical' and data_types[var2] == 'categorical':
       hypothesis_testing = 'categorical'
       categorical_variable1 = var1
       categorical_variable2 = var2
       print(f"< INITIALIZING predict_hypothesis() >")
       print(f"Performing {hypothesis_testing} hypothesis testing with:\n")
       print(f"  ➡ Categorical variable 1: '{categorical_variable1}'\n  ➡ Categorical variable 2: '{categorical_variable2}'\n\n")
    else:
       raise ValueError(f"predict_hypothesis(): Both of the provided variables are numerical.\n - To do numerical hypothesis testing, provide a numerical variable (target variable) and a categorical variable (grouping variable).\n - To do categorical hypothesis testing, provide two categorical variables.")
    • Explanation:
      • The function determines whether to perform numerical or categorical hypothesis testing based on the types of var1 and var2. It prints out relevant details for the chosen hypothesis testing type.
  3. Execute Hypothesis Testing:

    • Purpose: Perform appropriate hypothesis testing based on the determined type.
    # perform appropriate hypothesis testing process
    if hypothesis_testing == 'numerical':
       # evaluate test assumptions
       normality_bool = evaluate_normality(df, target_variable, grouping_variable, method=normality_method.lower(), pipeline=True)
       equal_variance_bool = evaluate_variance(df, target_variable, grouping_variable, normality_info=normality_bool, method=variance_method.lower(), pipeline=True)
    
       # perform hypothesis testing
       output_info = hypothesis_predictor_core_n(df, target_variable, grouping_variable, normality_bool, equal_variance_bool)
       return output_info
    elif hypothesis_testing == 'categorical':
       # create contingency table
       contingency_table = pd.crosstab(df[categorical_variable1], df[categorical_variable2])
    
       # evaluate test criteria
       chi2_viability, yates_correction_viability, barnard_viability, boschloo_viability, fisher_viability = evaluate_contingency_table(contingency_table, min_sample_size_yates=yates_min_sample_size, pipeline=True, quiet=True)
    
       # perform hypothesis testing
       output_info = hypothesis_predictor_core_c(contingency_table, chi2_viability, barnard_viability, boschloo_viability, fisher_viability, yates_correction_viability, alternative=exact_tests_alternative.lower())
       return output_info
    • Explanation:
      • The function first evaluates the assumptions for normality and equal variance for numerical testing.
      • Then, it uses hypothesis_predictor_core_n for numerical hypothesis testing or hypothesis_predictor_core_c for categorical hypothesis testing based on the determined type.
      • It returns the results in the output_info dictionary containing the relevant information for each type of hypothesis test.

Link to Full Code: predict_hypothesis.py.