ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Develop hypothesis_predictor_core_n() for predict_hypothesis() #89

Closed ETA444 closed 5 months ago

ETA444 commented 6 months ago

Title: Numerical Hypothesis Testing Function - Back End Core of predict_hypothesis()

Description: The hypothesis_predictor_core_n function conducts hypothesis testing on numerical data with categorical grouping variables, selecting appropriate statistical tests based on data characteristics such as normality and equal variances. This enhancement aims to improve the function's flexibility, error handling, and informative output to facilitate hypothesis testing in data analysis workflows.

Proposed Changes:

Expected Outcome: With this enhancement, the hypothesis_predictor_core_n function will become a more robust and user-friendly tool for conducting hypothesis testing on numerical data with categorical grouping variables. It will provide clearer guidance on the appropriate statistical tests to use based on the characteristics of the data, helping users make informed decisions in their data analysis tasks.

Additional Context: Improving the functionality of the hypothesis_predictor_core_n function aligns with our goal of providing high-quality statistical analysis tools that support efficient and reliable data analysis workflows. This enhancement addresses common needs and challenges encountered in hypothesis testing scenarios involving numerical data and categorical grouping variables, contributing to the overall usability and effectiveness of our data analysis toolkit.

ETA444 commented 5 months ago

Implementation Summary:

The hypothesis_predictor_core_n() function conducts hypothesis testing on numerical data, choosing appropriate tests based on data characteristics. It performs hypothesis testing between groups defined by a categorical variable for a numerical target variable. The function selects the appropriate statistical test based on the normality of the data and the homogeneity of variances across groups, utilizing t-tests, Mann-Whitney U tests, ANOVA, or Kruskal-Wallis tests as appropriate.

Code Breakdown:

  1. Initial Setup and Error Handling:

    • Purpose: Validate input types and values to ensure correct data is being analyzed.
    # Error Handling
    # TypeErrors
    if not isinstance(df, pd.DataFrame):
       raise TypeError("predictor_core_numerical(): The 'df' parameter must be a pandas DataFrame.")
    
    if not isinstance(target_variable, str):
       raise TypeError("predictor_core_numerical(): The 'target_variable' must be a string.")
    
    if not isinstance(grouping_variable, str):
       raise TypeError("predictor_core_numerical(): The 'grouping_variable' must be a string.")
    
    if not isinstance(normality_bool, bool):
       raise TypeError("predictor_core_numerical(): The 'normality_bool' must be a boolean.")
    
    if not isinstance(equal_variances_bool, bool):
       raise TypeError("predictor_core_numerical(): The 'equal_variances_bool' must be a boolean.")
    
    # ValueErrors
    if df.empty:
       raise ValueError("predictor_core_n(): The input DataFrame is empty.")
    
    if target_variable not in df.columns:
       raise ValueError(f"predictor_core_n(): The target variable '{target_variable}' was not found in the DataFrame.")
    
    if grouping_variable not in df.columns:
       raise ValueError(f"predictor_core_n(): The grouping variable '{grouping_variable}' was not found in the DataFrame.")
    
    target_variable_is_numerical = evaluate_dtype(df, [target_variable], output='list_n')[0]
    if not target_variable_is_numerical:
       raise ValueError(f"predictor_core_n(): The target variable '{target_variable}' must be a numerical variable.")
    
    grouping_variable_is_categorical = evaluate_dtype(df, [grouping_variable], output='list_c')[0]
    if not grouping_variable_is_categorical:
       raise ValueError(f"predictor_core_n(): The grouping variable '{grouping_variable}' must be a categorical variable.")
    • Explanation:
      • The function first checks if the input types are correct and then validates the input values. It ensures the parameters are appropriate for further analysis.
  2. Main Function:

    • Purpose: Determine the appropriate test based on data characteristics and perform the hypothesis testing.
    # Main Function
    groups = df[grouping_variable].unique().tolist()
    samples = [df[df[grouping_variable] == group][target_variable] for group in groups]
    
    # define output object
    output_info = {}
    
    if len(samples) == 2:  # two sample testing
       if normality_bool and equal_variances_bool:  # parametric testing
           stat, p_val = ttest_ind(*samples)
           test_name = 'Independent Samples T-Test'
           conclusion = "The data do not provide sufficient evidence to conclude a significant difference between the group means, failing to reject the null hypothesis." if p_val > 0.05 else "The data provide sufficient evidence to conclude a significant difference between the group means, rejecting the null hypothesis."
           output_info['ttest_ind'] = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'test_name': test_name, 'normality': normality_bool, 'equal_variance': equal_variances_bool}
       else:  # non-parametric testing
           stat, p_val = mannwhitneyu(*samples)
           test_name = 'Mann-Whitney U Rank Test (Two Independent Samples)'
           conclusion = "The data do not provide sufficient evidence to conclude a significant difference in group distributions, failing to reject the null hypothesis." if p_val > 0.05 else "The data provide sufficient evidence to conclude a significant difference in group distributions, rejecting the null hypothesis."
           output_info['mannwhitneyu'] = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'test_name': test_name, 'normality': normality_bool, 'equal_variance': equal_variances_bool}
    else:  # more than two samples
       if normality_bool and equal_variances_bool:
           stat, p_val = f_oneway(*samples)
           test_name = f'One-way ANOVA (with {len(samples)} groups)'
           conclusion = "The data do not provide sufficient evidence to conclude a significant difference among the group means, failing to reject the null hypothesis." if p_val > 0.05 else "The data provide sufficient evidence to conclude a significant difference among the group means, rejecting the null hypothesis."
           output_info['f_oneway'] = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'test_name': test_name, 'normality': normality_bool, 'equal_variance': equal_variances_bool}
       else:
           stat, p_val = kruskal(*samples)
           test_name = f'Kruskal-Wallis H-test (with {len(samples)} groups)'
           conclusion = "The data do not provide sufficient evidence to conclude a significant difference among the group distributions, failing to reject the null hypothesis." if p_val > 0.05 else "The data provide sufficient evidence to conclude a significant difference among the group distributions, rejecting the null hypothesis."
           output_info['kruskal'] = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'test_name': test_name, 'normality': normality_bool, 'equal_variance': equal_variances_bool}
    • Explanation:
      • The function checks the number of groups defined by the categorical variable and then applies an appropriate test based on the data characteristics.
      • Two-Group Testing:
      • If there are two groups, the function uses either an independent samples T-Test (parametric) or Mann-Whitney U test (non-parametric) depending on the normality and equal variances assumptions.
      • It then constructs a conclusion statement based on the p-value.
      • Multiple-Group Testing:
      • If there are more than two groups, the function uses either a one-way ANOVA (parametric) or Kruskal-Wallis H test (non-parametric) based on the same assumptions.
      • It constructs a similar conclusion statement.
      • The results, including the statistics, p-value, conclusion, test name, and assumptions, are saved in the output_info dictionary.
  3. Constructing Console Output:

    • Purpose: Display the results of the hypothesis testing in a clear format.
    # construct console output and return
    print(f"< HYPOTHESIS TESTING: {test_name}>\nBased on:\n  ➡ Normality assumption: {'✔' if normality_bool else '✘'}\n  ➡ Equal variances assumption: {'✔' if equal_variances_bool else '✘'}\n  ➡ Nr. of Groups: {len(samples)} groups\n  ∴ predict_hypothesis() is performing {test_name}:\n")
    print(f"Results of {test_name}:\n  ➡ statistic: {stat}\n  ➡ p-value: {p_val}\n  ∴ Conclusion: {conclusion}\n")
    return output_info
    • Explanation:
      • The function constructs a string to display the test name, assumptions, number of groups, and results.
      • The console output clearly indicates the chosen test, the assumptions, and the conclusion based on the p-value.
      • The function then returns the output_info dictionary containing all the relevant information.

Link to Full Code: predict_hypothesis.py.