ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new explore_num() method: 'multicollinearity' #32

Closed ETA444 closed 7 months ago

ETA444 commented 9 months ago

Description:


Method Functionality Idea:

The multicollinearity method checks for multicollinearity among numerical variables using Variance Inflation Factors (VIF). Multicollinearity occurs when independent variables in a regression model are highly correlated with each other, leading to issues with parameter estimation and interpretation.

How it operates:

The method first prepares the data by removing missing values. Then, it calculates the VIF for each numerical variable using the calculate_vif function. The VIF measures how much the variance of an estimated regression coefficient is inflated due to multicollinearity. Higher VIF values indicate stronger multicollinearity concerns. The method returns a DataFrame displaying the VIF values for each numerical variable.

Usage:

To check for multicollinearity among numerical variables using VIF:

vif_df = explore_num(df, numerical_variables, method='multicollinearity')
print(vif_df)

This method returns a DataFrame containing the VIF values for each numerical variable.

Example:

# Check for multicollinearity using VIF for numerical variables 'Feature1' and 'Feature2'
vif_df = explore_num(df, ['Feature1', 'Feature2'], method='multicollinearity')
print(vif_df)

Notes:


ETA444 commented 7 months ago

Implementation Summary:

The implementation of the multicollinearity method within the explore_num() function is designed to check for multicollinearity among the numerical variables in a DataFrame. This is done using the Variance Inflation Factor (VIF). Multicollinearity can impact the stability and interpretability of regression models, and this method helps to identify such issues.

Code Breakdown:

  1. Function Definition and Parameters:
def explore_num(
        df: pd.DataFrame,
        numerical_variables: List[str],
        method: str = 'all',
        output: str = 'print',
        threshold_z: int = 3
) -> Optional[str]:
    """
    Analyze numerical variables in a DataFrame for distribution characteristics, outlier detection, normality tests, skewness, kurtosis, correlation analysis, and multicollinearity detection.

    # Full docstring continues...
    """
  1. Method: 'multicollinearity'
if method.lower() in ['multicollinearity', 'all']:
    # use non-na df: data
    data = df.copy()
    data = data[numerical_variables].dropna()

    vifs = calculate_vif(data, numerical_variables)
    result.append(f"\n<<______MULTICOLLINEARITY CHECK - VIF______>>\n")
    result.append(f"Variance Inflation Factors:\n{vifs.to_string()}\n")
    result.append("☻ Tip: VIF > 10 indicates potential multicollinearity concerns.")
  1. Combining Results and Output:
combined_result = "\n".join(result)

if output.lower() == 'print':
    print(combined_result)

    if method.lower() == 'multicollinearity':
        return vifs
elif output.lower() == 'return':
    if method.lower() in ['all', 'multicollinearity']:
        return combined_result

    if method.lower() == 'multicollinearity':
        return vifs

See the Full Function:

The full implementation can be found in the datasafari repository.