Develop hypothesis_predictor_core_c() for predict_hypothesis()

Implement `hypothesis_predictor_core_c()` Function for `predict_hypothesis()`

Implementation Summary:

The hypothesis_predictor_core_c() function conducts categorical hypothesis testing using contingency tables and appropriate statistical tests. The function assesses the association between two categorical variables by applying a series of statistical tests based on the shape of the contingency table, the minimum expected and observed frequencies, and specific methodological preferences, including Yates' correction for chi-square tests and alternatives for exact tests.

Code Breakdown:

Initial Setup and Error Handling:
- Purpose: Validate input types and values to ensure correct data is being analyzed.

 # Error Handling
 # TypeErrors
 if not isinstance(contingency_table, pd.DataFrame):
     raise TypeError("predictor_core_c(): The 'contingency_table' parameter must be a pandas DataFrame.")

 if not isinstance(chi2_viability, bool):
     raise TypeError("predictor_core_c(): The 'chi2_viability' parameter must be a boolean.")

 if not isinstance(barnard_viability, bool):
     raise TypeError("predictor_core_c(): The 'barnard_viability' parameter must be a boolean.")

 if not isinstance(boschloo_viability, bool):
     raise TypeError("predictor_core_c(): The 'boschloo_viability' parameter must be a boolean.")

 if not isinstance(fisher_viability, bool):
     raise TypeError("predictor_core_c(): The 'fisher_viability' parameter must be a boolean.")

 if not isinstance(yates_correction_viability, bool):
     raise TypeError("predictor_core_c(): The 'yates_correction_viability' parameter must be a boolean.")

 if not isinstance(alternative, str):
     raise TypeError("predictor_core_c(): The 'alternative' parameter must be a string.")

 # ValueErrors
 if contingency_table.empty:
     raise ValueError("predictor_core_c(): The input contingency table is empty.")

 valid_alternatives = ['two-sided', 'less', 'greater']
 if alternative not in valid_alternatives:
     raise ValueError(f"predictor_core_c(): Invalid 'alternative' value. Expected one of {valid_alternatives}, got '{alternative}'.")

Explanation:
- The function first checks if the input types are correct and then validates the input values. It ensures the parameters are appropriate for further analysis.

Main Function:
- Purpose: Perform appropriate categorical tests based on data characteristics and generate results.

 # Main Function
 categorical_variable1 = contingency_table.index.name
 categorical_variable2 = contingency_table.columns.name
 min_expected_frequency = expected_freq(contingency_table).min()
 min_observed_frequency = contingency_table.min().min()
 contingency_table_shape = contingency_table.shape
 sample_size = np.sum(contingency_table.values)

 # define output object
 output_info = {}

 if chi2_viability:
     if yates_correction_viability:
         stat, p_val, dof, expected_frequencies = chi2_contingency(contingency_table, correction=True)
         test_name = f"Chi-square test (with Yates' Correction)"
         chi2_tip = f"\n\n☻ Tip: The Chi-square test of independence with Yates' Correction is used for 2x2 contingency tables with small sample sizes. Yates' Correction makes the test more conservative, reducing the Type I error rate by adjusting for the continuity of the chi-squared distribution. This correction is typically applied when sample sizes are small (often suggested for total sample sizes less than about 40), aiming to avoid overestimation of statistical significance."
         conclusion = f"There is no statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {p_val:.3f})." if p_val > 0.05 else f"There is a statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {p_val:.3f})."
         chi2_output_info = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'yates_correction': yates_correction_viability, 'tip': chi2_tip, 'test_name': test_name}
         output_info['chi2_contingency'] = chi2_output_info
     else:
         stat, p_val, dof, expected_frequencies = chi2_contingency(contingency_table, correction=False)
         test_name = f"Chi-square test (without Yates' Correction)"
         chi2_tip = f"\n\n☻ Tip: The Chi-square test of independence without Yates' Correction is preferred when analyzing larger contingency tables or when sample sizes are sufficiently large, even for 2x2 tables (often suggested for total sample sizes greater than 40). Removing Yates' Correction can increase the test's power by not artificially adjusting for continuity, making it more sensitive to detect genuine associations between variables in settings where the assumptions of the chi-squared test are met."
         conclusion = f"There is no statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {p_val:.3f})." if p_val > 0.05 else f"There is a statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {p_val:.3f})."
         chi2_output_info = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'yates_correction': yates_correction_viability, 'tip': chi2_tip, 'test_name': test_name}
         output_info['chi2_contingency'] = chi2_output_info

Explanation:
- The function checks the viability of using a chi-square test with or without Yates' correction.
- It constructs a conclusion based on the p-value.
- The results, including the statistics, p-value, conclusion, and other details, are saved in the output_info dictionary.

Performing Exact Tests:

Purpose: Handle cases where chi-square tests are not suitable, and employ exact tests.

else:
 if barnard_viability:
     barnard_test = barnard_exact(contingency_table, alternative=alternative.lower())
     barnard_stat = barnard_test.statistic
     barnard_p_val = barnard_test.pvalue
     bernard_test_name = f"Barnard's exact test ({alternative.lower()})"
     bernard_tip = f"\n\n☻ Tip: Barnard's exact test is often preferred for its power, especially in unbalanced designs or with small sample sizes, without the need for the continuity correction that Fisher's test applies."
     if alternative.lower() == 'two-sided':
         bernard_conclusion = f"There is no statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {barnard_p_val:.3f})." if barnard_p_val > 0.05 else f"There is a statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {barnard_p_val:.3f})."
     elif alternative.lower() == 'less':
         bernard_conclusion = f"The data do not support a statistically significant decrease in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {barnard_p_val:.3f})." if barnard_p_val > 0.05 else f"The data support a statistically significant decrease in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {barnard_p_val:.3f})."
     elif alternative.lower() == 'greater':
         bernard_conclusion = f"The data do not support a statistically significant increase in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {barnard_p_val:.3f})." if barnard_p_val > 0.05 else f"The data support a statistically significant increase in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {barnard_p_val:.3f})."

     # consolidate info for output
     bernard_output_info = {'stat': barnard_stat, 'p_val': barnard_p_val, 'conclusion': bernard_conclusion, 'alternative': alternative.lower(), 'tip': bernard_tip, 'test_name': bernard_test_name}
     output_info['barnard_exact'] = bernard_output_info

Explanation:
The function uses Barnard's exact test and constructs appropriate output based on the chosen alternative hypothesis.
The results are saved in the output_info dictionary.


  if boschloo_viability:
      boschloo_test = boschloo_exact(contingency_table, alternative=alternative.lower())
      boschloo_stat = boschloo_test.statistic
      boschloo_p_val = boschloo_test.pvalue
      boschloo_test_name = f"Boschloo's exact test ({alternative.lower()})"
      boschloo_tip = f"\n\n☻ Tip: Boschloo's exact test is an extension that increases power by combining the strengths of Fisher's and Barnard's tests, focusing on the most extreme probabilities."
      if alternative.lower() == 'two-sided':
          boschloo_conclusion = f"There is no statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {boschloo_p_val:.3f})." if boschloo_p_val > 0.

05 else f"There is a statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {boschloo_p_val:.3f})." elif alternative.lower() == 'less': boschloo_conclusion = f"The data do not support a statistically significant decrease in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {boschloo_p_val:.3f})." if boschloo_p_val > 0.05 else f"The data support a statistically significant decrease in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {boschloo_p_val:.3f})." elif alternative.lower() == 'greater': boschloo_conclusion = f"The data do not support a statistically significant increase in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {boschloo_p_val:.3f})." if boschloo_p_val > 0.05 else f"The data support a statistically significant increase in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {boschloo_p_val:.3f})."

      # consolidate info for output
      boschloo_output_info = {'stat': boschloo_stat, 'p_val': boschloo_p_val, 'conclusion': boschloo_conclusion, 'alternative': alternative.lower(), 'tip': boschloo_tip, 'test_name': boschloo_test_name}
      output_info['boschloo_exact'] = boschloo_output_info

  if fisher_viability:
      fisher_test = fisher_exact(contingency_table, alternative=alternative.lower())
      fisher_stat = fisher_test[0]
      fisher_p_val = fisher_test[1]
      fisher_test_name = f"Fisher's exact test ({alternative.lower()})"
      fisher_tip = f"\n☻ Tip: Fisher's exact test is traditionally used for small sample sizes, providing exact p-values under the null hypothesis of independence."
      if alternative.lower() == 'two-sided':
          fisher_conclusion = f"There is no statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {fisher_p_val:.3f})." if fisher_p_val > 0.05 else f"There is a statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {fisher_p_val:.3f})."
      elif alternative.lower() == 'less':
          fisher_conclusion = f"The data do not support a statistically significant decrease in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {fisher_p_val:.3f})." if fisher_p_val > 0.05 else f"The data support a statistically significant decrease in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {fisher_p_val:.3f})."
      elif alternative.lower() == 'greater':
          fisher_conclusion = f"The data do not support a statistically significant increase in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {fisher_p_val:.3f})." if fisher_p_val > 0.05 else f"The data support a statistically significant increase in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {fisher_p_val:.3f})."

      # consolidate info for output
      fisher_output_info = {'stat': fisher_stat, 'p_val': fisher_p_val, 'conclusion': fisher_conclusion, 'alternative': alternative.lower(), 'tip': fisher_tip, 'test_name': fisher_test_name}
      output_info['fisher_exact'] = fisher_output_info


4. **Constructing Console Output:**

 - **Purpose:** Display the results of the hypothesis testing in a clear format.

```python
 # console output & return
 generate_test_names = [info['test_name'] for info in output_info.values()]
 print(f"< HYPOTHESIS TESTING: {generate_test_names}>\nBased on:\n  ➡ Sample size: {sample_size}\n  ➡ Minimum Observed Frequency: {min_observed_frequency}\n  ➡ Minimum Expected Frequency: {min_expected_frequency}\n  ➡ Contingency table shape: {contingency_table_shape[0]}x{contingency_table_shape[1]}\n  ∴ Performing {generate_test_names}:")
 print("\n☻ Tip: Consider the tips provided for each test and assess which of the exact tests provided is most suitable for your data.") if len(output_info) > 1 else print('')
 [print(f"Results of {info['test_name']}:\n  ➡ statistic: {info['stat']}\n  ➡ p-value: {info['p_val']}\n  ∴ Conclusion: {info['conclusion']}{info['tip']}\n") for info in output_info.values()]
 return output_info

Explanation:
- The function constructs a string to display the test names, assumptions, and results.
- The console output clearly indicates the chosen tests, sample size, and conclusions based on the p-values.
- The function then returns the output_info dictionary containing all the relevant information.

Link to Full Code: predict_hypothesis.py.

ETA444 / datasafari