Implement new explore_num() method: 'outliers_mahalanobis'

Description:

Method Functionality Idea:

The outliers_mahalanobis method identifies outliers in numerical variables using the Mahalanobis distance, which measures the distance of each observation from the mean in multidimensional space. This method is particularly effective for detecting outliers in multivariate data and takes into account the covariance structure between variables.

How it operates:

The method first prepares the data by removing missing values and calculates the mean vector and the inverse of the covariance matrix. Then, it calculates the Mahalanobis distance for each observation using a utility function (calculate_mahalanobis). Next, it determines outliers based on the Mahalanobis distance exceeding a critical value derived from the chi-square distribution. Observations with Mahalanobis distances greater than the critical value are classified as outliers. The method returns a DataFrame containing the rows from the original DataFrame classified as outliers.

Usage:

To detect outliers in numerical variables using the Mahalanobis distance method:

outliers_mahalanobis_df = explore_num(df, numerical_variables, method='outliers_mahalanobis')

This method returns a DataFrame containing the rows from the original DataFrame classified as outliers based on the Mahalanobis distance.

Example:

# Detect outliers using the Mahalanobis distance method for numerical variables 'Feature1' and 'Feature2'
outliers_mahalanobis_df = explore_num(df, ['Feature1', 'Feature2'], method='outliers_mahalanobis')
print(outliers_mahalanobis_df.head())

Notes:

The Mahalanobis distance method is effective for identifying outliers in multivariate data by considering the covariance structure between variables.
Outliers are observations with Mahalanobis distances exceeding a critical value derived from the chi-square distribution.
If the method parameter is set to 'all', the function also returns a DataFrame containing the original DataFrame's outlier rows based on Mahalanobis distance.

Implementation Summary:

The implementation of the outliers_mahalanobis method within the explore_num() function focuses on detecting outliers in the numerical variables of a DataFrame using the Mahalanobis distance. The Mahalanobis distance is a useful metric for identifying multivariate outliers, particularly when the variables are correlated.

Code Breakdown:

Function Definition and Parameters:

def explore_num(
        df: pd.DataFrame,
        numerical_variables: List[str],
        method: str = 'all',
        output: str = 'print',
        threshold_z: int = 3
) -> Optional[str]:
    """
    Analyze numerical variables in a DataFrame for distribution characteristics, outlier detection, normality tests, skewness, kurtosis, correlation analysis, and multicollinearity detection.

    # Full docstring continues...
    """

Purpose: The explore_num() function analyzes numerical variables in a DataFrame.
Parameters: The function takes a DataFrame (df), a list of numerical variables (numerical_variables), a method (method), an output format (output), and a threshold for Z-score outliers (threshold_z).

Method: 'outliers_mahalanobis'

if method.lower() in ['outliers_mahalanobis', 'all']:
    # definitions #
    # use non-na df: data
    data = df.copy()
    data = data[numerical_variables].dropna()

    try:
        # calculate the mean and inverse of the covariance matrix
        mean_vector = data.mean().values
        inv_cov_matrix = inv(np.cov(data, rowvar=False))

        # apply the utility function to calculate Mahalanobis distance for each observation
        data['mahalanobis'] = data.apply(lambda row: calculate_mahalanobis(row.values, mean_vector, inv_cov_matrix), axis=1)

        # determine outliers based on the chi-square distribution
        p_value_threshold = 0.05
        critical_value = chi2.ppf((1 - p_value_threshold), df=len(numerical_variables))

        # classify outliers based on mahalanobis distance relative to critical value
        outliers_mahalanobis_df = data[data['mahalanobis'] > critical_value]

        # clean up df
        data.drop(columns=['mahalanobis'], inplace=True)

        # construct console output
        result.append(f"\n<<______OUTLIERS - MAHALANOBIS METHOD______>>\n")
        result.append(f"Identified outliers based on Mahalanobis distance exceeding the critical value ({critical_value:.2f}) from the chi-square distribution (p-val < {p_value_threshold}.\n")
        result.append(outliers_mahalanobis_df.to_string())

        # appends (continued) #
        # (6-9) method='outliers_mahalanobis' info
        if method.lower() == 'all':
            result.append(f"\n✎ * NOTE: If method='outliers_mahalanobis', aside from the overview above, the function RETURNS:")
            result.append(f"■ 1 - Dataframe: Rows from the original df that were classified as outliers. (preserved index)")
            result.append(f"☻ HOW TO: df = explore_num(yourdf, yourlist, method='outliers_mahalanobis')")

    except np.linalg.LinAlgError as error:
        result.append(f"Error calculating Mahalanobis distance: {error}")

Purpose: This section identifies outliers using the Mahalanobis distance.
Steps:
- Data Preparation: A copy of the DataFrame is created, and non-NA numerical variables are retained.
- Mean and Covariance: The mean vector and inverse covariance matrix are calculated for the numerical variables.
- Distance Calculation: The Mahalanobis distance is calculated for each observation.
- Outlier Identification: Outliers are identified based on a chi-square distribution.
- Output: The identified outliers are appended to the result list and displayed.

Combining Results and Output:

combined_result = "\n".join(result)

if output.lower() == 'print':
    print(combined_result)

    if method.lower() == 'outliers_mahalanobis':
        return outliers_mahalanobis_df
elif output.lower() == 'return':
    if method.lower() == 'outliers_mahalanobis':
        return outliers_mahalanobis_df

Purpose: This section combines the results and determines the output format.
Output: Depending on the output parameter, the results are either printed or returned. For the outliers_mahalanobis method, the DataFrame containing the outliers is returned when output is set to 'return'.

See the Full Function:

The full implementation can be found in the datasafari repository.

ETA444 / datasafari