ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Develop hypothesis_predictor_core_c() for predict_hypothesis() #96

Closed ETA444 closed 4 months ago

ETA444 commented 5 months ago

Title: Categorical Hypothesis Testing Function - Back End Core of predict_hypothesis()

Description: The hypothesis_predictor_core_c function currently conducts categorical hypothesis testing using contingency tables and appropriate statistical tests. This enhancement aims to improve the function's flexibility, error handling, output clarity, and informative tips for users.

Proposed Changes:

Expected Outcome: With this enhancement, the predictor_core_c function will become a more robust and user-friendly tool for conducting categorical hypothesis testing using contingency tables. It will provide clearer guidance on the appropriate statistical tests to use based on the characteristics of the data, helping users make informed decisions in their data analysis tasks.

Additional Context: Improving the functionality of the hypothesis_predictor_core_c function aligns with our goal of providing high-quality statistical analysis tools that support efficient and reliable data analysis workflows. This enhancement addresses common needs and challenges encountered in categorical hypothesis testing scenarios involving contingency tables, contributing to the overall usability and effectiveness of our data analysis toolkit.

ETA444 commented 4 months ago

Implement hypothesis_predictor_core_c() Function for predict_hypothesis()

Implementation Summary:

The hypothesis_predictor_core_c() function conducts categorical hypothesis testing using contingency tables and appropriate statistical tests. The function assesses the association between two categorical variables by applying a series of statistical tests based on the shape of the contingency table, the minimum expected and observed frequencies, and specific methodological preferences, including Yates' correction for chi-square tests and alternatives for exact tests.

Code Breakdown:

  1. Initial Setup and Error Handling:

    • Purpose: Validate input types and values to ensure correct data is being analyzed.
 # Error Handling
 # TypeErrors
 if not isinstance(contingency_table, pd.DataFrame):
     raise TypeError("predictor_core_c(): The 'contingency_table' parameter must be a pandas DataFrame.")

 if not isinstance(chi2_viability, bool):
     raise TypeError("predictor_core_c(): The 'chi2_viability' parameter must be a boolean.")

 if not isinstance(barnard_viability, bool):
     raise TypeError("predictor_core_c(): The 'barnard_viability' parameter must be a boolean.")

 if not isinstance(boschloo_viability, bool):
     raise TypeError("predictor_core_c(): The 'boschloo_viability' parameter must be a boolean.")

 if not isinstance(fisher_viability, bool):
     raise TypeError("predictor_core_c(): The 'fisher_viability' parameter must be a boolean.")

 if not isinstance(yates_correction_viability, bool):
     raise TypeError("predictor_core_c(): The 'yates_correction_viability' parameter must be a boolean.")

 if not isinstance(alternative, str):
     raise TypeError("predictor_core_c(): The 'alternative' parameter must be a string.")

 # ValueErrors
 if contingency_table.empty:
     raise ValueError("predictor_core_c(): The input contingency table is empty.")

 valid_alternatives = ['two-sided', 'less', 'greater']
 if alternative not in valid_alternatives:
     raise ValueError(f"predictor_core_c(): Invalid 'alternative' value. Expected one of {valid_alternatives}, got '{alternative}'.")
  1. Main Function:

    • Purpose: Perform appropriate categorical tests based on data characteristics and generate results.
 # Main Function
 categorical_variable1 = contingency_table.index.name
 categorical_variable2 = contingency_table.columns.name
 min_expected_frequency = expected_freq(contingency_table).min()
 min_observed_frequency = contingency_table.min().min()
 contingency_table_shape = contingency_table.shape
 sample_size = np.sum(contingency_table.values)

 # define output object
 output_info = {}

 if chi2_viability:
     if yates_correction_viability:
         stat, p_val, dof, expected_frequencies = chi2_contingency(contingency_table, correction=True)
         test_name = f"Chi-square test (with Yates' Correction)"
         chi2_tip = f"\n\n☻ Tip: The Chi-square test of independence with Yates' Correction is used for 2x2 contingency tables with small sample sizes. Yates' Correction makes the test more conservative, reducing the Type I error rate by adjusting for the continuity of the chi-squared distribution. This correction is typically applied when sample sizes are small (often suggested for total sample sizes less than about 40), aiming to avoid overestimation of statistical significance."
         conclusion = f"There is no statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {p_val:.3f})." if p_val > 0.05 else f"There is a statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {p_val:.3f})."
         chi2_output_info = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'yates_correction': yates_correction_viability, 'tip': chi2_tip, 'test_name': test_name}
         output_info['chi2_contingency'] = chi2_output_info
     else:
         stat, p_val, dof, expected_frequencies = chi2_contingency(contingency_table, correction=False)
         test_name = f"Chi-square test (without Yates' Correction)"
         chi2_tip = f"\n\n☻ Tip: The Chi-square test of independence without Yates' Correction is preferred when analyzing larger contingency tables or when sample sizes are sufficiently large, even for 2x2 tables (often suggested for total sample sizes greater than 40). Removing Yates' Correction can increase the test's power by not artificially adjusting for continuity, making it more sensitive to detect genuine associations between variables in settings where the assumptions of the chi-squared test are met."
         conclusion = f"There is no statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {p_val:.3f})." if p_val > 0.05 else f"There is a statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {p_val:.3f})."
         chi2_output_info = {'stat': stat, 'p_val': p_val, 'conclusion': conclusion, 'yates_correction': yates_correction_viability, 'tip': chi2_tip, 'test_name': test_name}
         output_info['chi2_contingency'] = chi2_output_info
  1. Performing Exact Tests:

    • Purpose: Handle cases where chi-square tests are not suitable, and employ exact tests.
    else:
     if barnard_viability:
         barnard_test = barnard_exact(contingency_table, alternative=alternative.lower())
         barnard_stat = barnard_test.statistic
         barnard_p_val = barnard_test.pvalue
         bernard_test_name = f"Barnard's exact test ({alternative.lower()})"
         bernard_tip = f"\n\n☻ Tip: Barnard's exact test is often preferred for its power, especially in unbalanced designs or with small sample sizes, without the need for the continuity correction that Fisher's test applies."
         if alternative.lower() == 'two-sided':
             bernard_conclusion = f"There is no statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {barnard_p_val:.3f})." if barnard_p_val > 0.05 else f"There is a statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {barnard_p_val:.3f})."
         elif alternative.lower() == 'less':
             bernard_conclusion = f"The data do not support a statistically significant decrease in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {barnard_p_val:.3f})." if barnard_p_val > 0.05 else f"The data support a statistically significant decrease in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {barnard_p_val:.3f})."
         elif alternative.lower() == 'greater':
             bernard_conclusion = f"The data do not support a statistically significant increase in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {barnard_p_val:.3f})." if barnard_p_val > 0.05 else f"The data support a statistically significant increase in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {barnard_p_val:.3f})."
    
         # consolidate info for output
         bernard_output_info = {'stat': barnard_stat, 'p_val': barnard_p_val, 'conclusion': bernard_conclusion, 'alternative': alternative.lower(), 'tip': bernard_tip, 'test_name': bernard_test_name}
         output_info['barnard_exact'] = bernard_output_info
    • Explanation:
    • The function uses Barnard's exact test and constructs appropriate output based on the chosen alternative hypothesis.
    • The results are saved in the output_info dictionary.
    
      if boschloo_viability:
          boschloo_test = boschloo_exact(contingency_table, alternative=alternative.lower())
          boschloo_stat = boschloo_test.statistic
          boschloo_p_val = boschloo_test.pvalue
          boschloo_test_name = f"Boschloo's exact test ({alternative.lower()})"
          boschloo_tip = f"\n\n☻ Tip: Boschloo's exact test is an extension that increases power by combining the strengths of Fisher's and Barnard's tests, focusing on the most extreme probabilities."
          if alternative.lower() == 'two-sided':
              boschloo_conclusion = f"There is no statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {boschloo_p_val:.3f})." if boschloo_p_val > 0.

05 else f"There is a statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {boschloo_p_val:.3f})." elif alternative.lower() == 'less': boschloo_conclusion = f"The data do not support a statistically significant decrease in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {boschloo_p_val:.3f})." if boschloo_p_val > 0.05 else f"The data support a statistically significant decrease in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {boschloo_p_val:.3f})." elif alternative.lower() == 'greater': boschloo_conclusion = f"The data do not support a statistically significant increase in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {boschloo_p_val:.3f})." if boschloo_p_val > 0.05 else f"The data support a statistically significant increase in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {boschloo_p_val:.3f})."

      # consolidate info for output
      boschloo_output_info = {'stat': boschloo_stat, 'p_val': boschloo_p_val, 'conclusion': boschloo_conclusion, 'alternative': alternative.lower(), 'tip': boschloo_tip, 'test_name': boschloo_test_name}
      output_info['boschloo_exact'] = boschloo_output_info

  if fisher_viability:
      fisher_test = fisher_exact(contingency_table, alternative=alternative.lower())
      fisher_stat = fisher_test[0]
      fisher_p_val = fisher_test[1]
      fisher_test_name = f"Fisher's exact test ({alternative.lower()})"
      fisher_tip = f"\n☻ Tip: Fisher's exact test is traditionally used for small sample sizes, providing exact p-values under the null hypothesis of independence."
      if alternative.lower() == 'two-sided':
          fisher_conclusion = f"There is no statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {fisher_p_val:.3f})." if fisher_p_val > 0.05 else f"There is a statistically significant association between {categorical_variable1} and {categorical_variable2} (p = {fisher_p_val:.3f})."
      elif alternative.lower() == 'less':
          fisher_conclusion = f"The data do not support a statistically significant decrease in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {fisher_p_val:.3f})." if fisher_p_val > 0.05 else f"The data support a statistically significant decrease in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {fisher_p_val:.3f})."
      elif alternative.lower() == 'greater':
          fisher_conclusion = f"The data do not support a statistically significant increase in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {fisher_p_val:.3f})." if fisher_p_val > 0.05 else f"The data support a statistically significant increase in the frequency of {categorical_variable1} compared to {categorical_variable2} (p = {fisher_p_val:.3f})."

      # consolidate info for output
      fisher_output_info = {'stat': fisher_stat, 'p_val': fisher_p_val, 'conclusion': fisher_conclusion, 'alternative': alternative.lower(), 'tip': fisher_tip, 'test_name': fisher_test_name}
      output_info['fisher_exact'] = fisher_output_info

4. **Constructing Console Output:**

 - **Purpose:** Display the results of the hypothesis testing in a clear format.

```python
 # console output & return
 generate_test_names = [info['test_name'] for info in output_info.values()]
 print(f"< HYPOTHESIS TESTING: {generate_test_names}>\nBased on:\n  ➡ Sample size: {sample_size}\n  ➡ Minimum Observed Frequency: {min_observed_frequency}\n  ➡ Minimum Expected Frequency: {min_expected_frequency}\n  ➡ Contingency table shape: {contingency_table_shape[0]}x{contingency_table_shape[1]}\n  ∴ Performing {generate_test_names}:")
 print("\n☻ Tip: Consider the tips provided for each test and assess which of the exact tests provided is most suitable for your data.") if len(output_info) > 1 else print('')
 [print(f"Results of {info['test_name']}:\n  ➡ statistic: {info['stat']}\n  ➡ p-value: {info['p_val']}\n  ∴ Conclusion: {info['conclusion']}{info['tip']}\n") for info in output_info.values()]
 return output_info

Link to Full Code: predict_hypothesis.py.