Develop hypothesis_predictor_core_n() for predict_hypothesis()

Implementation Summary:

The hypothesis_predictor_core_n() function conducts hypothesis testing on numerical data, choosing appropriate tests based on data characteristics. It performs hypothesis testing between groups defined by a categorical variable for a numerical target variable. The function selects the appropriate statistical test based on the normality of the data and the homogeneity of variances across groups, utilizing t-tests, Mann-Whitney U tests, ANOVA, or Kruskal-Wallis tests as appropriate.

Code Breakdown:

Initial Setup and Error Handling:

Purpose: Validate input types and values to ensure correct data is being analyzed.

# Error Handling
# TypeErrors
if not isinstance(df, pd.DataFrame):
   raise TypeError("predictor_core_numerical(): The 'df' parameter must be a pandas DataFrame.")

if not isinstance(target_variable, str):
   raise TypeError("predictor_core_numerical(): The 'target_variable' must be a string.")

if not isinstance(grouping_variable, str):
   raise TypeError("predictor_core_numerical(): The 'grouping_variable' must be a string.")

if not isinstance(normality_bool, bool):
   raise TypeError("predictor_core_numerical(): The 'normality_bool' must be a boolean.")

if not isinstance(equal_variances_bool, bool):
   raise TypeError("predictor_core_numerical(): The 'equal_variances_bool' must be a boolean.")

# ValueErrors
if df.empty:
   raise ValueError("predictor_core_n(): The input DataFrame is empty.")

if target_variable not in df.columns:
   raise ValueError(f"predictor_core_n(): The target variable '{target_variable}' was not found in the DataFrame.")

if grouping_variable not in df.columns:
   raise ValueError(f"predictor_core_n(): The grouping variable '{grouping_variable}' was not found in the DataFrame.")

target_variable_is_numerical = evaluate_dtype(df, [target_variable], output='list_n')[0]
if not target_variable_is_numerical:
   raise ValueError(f"predictor_core_n(): The target variable '{target_variable}' must be a numerical variable.")

grouping_variable_is_categorical = evaluate_dtype(df, [grouping_variable], output='list_c')[0]
if not grouping_variable_is_categorical:
   raise ValueError(f"predictor_core_n(): The grouping variable '{grouping_variable}' must be a categorical variable.")

Explanation:
- The function first checks if the input types are correct and then validates the input values. It ensures the parameters are appropriate for further analysis.

Main Function:

Purpose: Determine the appropriate test based on data characteristics and perform the hypothesis testing.

# Main Function
groups = df[grouping_variable].unique().tolist()
samples = [df[df[grouping_variable] == group][target_variable] for group in groups]

# define output object
output_info = {}

if len(samples) == 2:  # two sample testing
   if normality_bool and equal_variances_bool:  # parametric testing
       stat, p_val = ttest_ind(*samples)
       test_name = 'Independent Samples T-Test'
       conclusion = "The data do not provide sufficient evidence to conclude a significant difference between the group means, failing to reject the null hypothesis." if p_val > 0.05 else "The data provide sufficient evidence to conclude a significant difference between the group means, rejecting the null hypothesis."
       output_info['ttest_ind'] = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'test_name': test_name, 'normality': normality_bool, 'equal_variance': equal_variances_bool}
   else:  # non-parametric testing
       stat, p_val = mannwhitneyu(*samples)
       test_name = 'Mann-Whitney U Rank Test (Two Independent Samples)'
       conclusion = "The data do not provide sufficient evidence to conclude a significant difference in group distributions, failing to reject the null hypothesis." if p_val > 0.05 else "The data provide sufficient evidence to conclude a significant difference in group distributions, rejecting the null hypothesis."
       output_info['mannwhitneyu'] = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'test_name': test_name, 'normality': normality_bool, 'equal_variance': equal_variances_bool}
else:  # more than two samples
   if normality_bool and equal_variances_bool:
       stat, p_val = f_oneway(*samples)
       test_name = f'One-way ANOVA (with {len(samples)} groups)'
       conclusion = "The data do not provide sufficient evidence to conclude a significant difference among the group means, failing to reject the null hypothesis." if p_val > 0.05 else "The data provide sufficient evidence to conclude a significant difference among the group means, rejecting the null hypothesis."
       output_info['f_oneway'] = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'test_name': test_name, 'normality': normality_bool, 'equal_variance': equal_variances_bool}
   else:
       stat, p_val = kruskal(*samples)
       test_name = f'Kruskal-Wallis H-test (with {len(samples)} groups)'
       conclusion = "The data do not provide sufficient evidence to conclude a significant difference among the group distributions, failing to reject the null hypothesis." if p_val > 0.05 else "The data provide sufficient evidence to conclude a significant difference among the group distributions, rejecting the null hypothesis."
       output_info['kruskal'] = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'test_name': test_name, 'normality': normality_bool, 'equal_variance': equal_variances_bool}

Explanation:
- The function checks the number of groups defined by the categorical variable and then applies an appropriate test based on the data characteristics.
- Two-Group Testing:
- If there are two groups, the function uses either an independent samples T-Test (parametric) or Mann-Whitney U test (non-parametric) depending on the normality and equal variances assumptions.
- It then constructs a conclusion statement based on the p-value.
- Multiple-Group Testing:
- If there are more than two groups, the function uses either a one-way ANOVA (parametric) or Kruskal-Wallis H test (non-parametric) based on the same assumptions.
- It constructs a similar conclusion statement.
- The results, including the statistics, p-value, conclusion, test name, and assumptions, are saved in the output_info dictionary.

Constructing Console Output:

Purpose: Display the results of the hypothesis testing in a clear format.

# construct console output and return
print(f"< HYPOTHESIS TESTING: {test_name}>\nBased on:\n  ➡ Normality assumption: {'✔' if normality_bool else '✘'}\n  ➡ Equal variances assumption: {'✔' if equal_variances_bool else '✘'}\n  ➡ Nr. of Groups: {len(samples)} groups\n  ∴ predict_hypothesis() is performing {test_name}:\n")
print(f"Results of {test_name}:\n  ➡ statistic: {stat}\n  ➡ p-value: {p_val}\n  ∴ Conclusion: {conclusion}\n")
return output_info

Explanation:
- The function constructs a string to display the test name, assumptions, number of groups, and results.
- The console output clearly indicates the chosen test, the assumptions, and the conclusion based on the p-value.
- The function then returns the output_info dictionary containing all the relevant information.

Link to Full Code: predict_hypothesis.py.

ETA444 / datasafari

Develop hypothesis_predictor_core_n() for predict_hypothesis() #89