ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new explore_num() method: 'outliers_mahalanobis' #31

Closed ETA444 closed 7 months ago

ETA444 commented 9 months ago

Description:


Method Functionality Idea:

The outliers_mahalanobis method identifies outliers in numerical variables using the Mahalanobis distance, which measures the distance of each observation from the mean in multidimensional space. This method is particularly effective for detecting outliers in multivariate data and takes into account the covariance structure between variables.

How it operates:

The method first prepares the data by removing missing values and calculates the mean vector and the inverse of the covariance matrix. Then, it calculates the Mahalanobis distance for each observation using a utility function (calculate_mahalanobis). Next, it determines outliers based on the Mahalanobis distance exceeding a critical value derived from the chi-square distribution. Observations with Mahalanobis distances greater than the critical value are classified as outliers. The method returns a DataFrame containing the rows from the original DataFrame classified as outliers.

Usage:

To detect outliers in numerical variables using the Mahalanobis distance method:

outliers_mahalanobis_df = explore_num(df, numerical_variables, method='outliers_mahalanobis')

This method returns a DataFrame containing the rows from the original DataFrame classified as outliers based on the Mahalanobis distance.

Example:

# Detect outliers using the Mahalanobis distance method for numerical variables 'Feature1' and 'Feature2'
outliers_mahalanobis_df = explore_num(df, ['Feature1', 'Feature2'], method='outliers_mahalanobis')
print(outliers_mahalanobis_df.head())

Notes:


ETA444 commented 7 months ago

Implementation Summary:

The implementation of the outliers_mahalanobis method within the explore_num() function focuses on detecting outliers in the numerical variables of a DataFrame using the Mahalanobis distance. The Mahalanobis distance is a useful metric for identifying multivariate outliers, particularly when the variables are correlated.

Code Breakdown:

  1. Function Definition and Parameters:
def explore_num(
        df: pd.DataFrame,
        numerical_variables: List[str],
        method: str = 'all',
        output: str = 'print',
        threshold_z: int = 3
) -> Optional[str]:
    """
    Analyze numerical variables in a DataFrame for distribution characteristics, outlier detection, normality tests, skewness, kurtosis, correlation analysis, and multicollinearity detection.

    # Full docstring continues...
    """
  1. Method: 'outliers_mahalanobis'
if method.lower() in ['outliers_mahalanobis', 'all']:
    # definitions #
    # use non-na df: data
    data = df.copy()
    data = data[numerical_variables].dropna()

    try:
        # calculate the mean and inverse of the covariance matrix
        mean_vector = data.mean().values
        inv_cov_matrix = inv(np.cov(data, rowvar=False))

        # apply the utility function to calculate Mahalanobis distance for each observation
        data['mahalanobis'] = data.apply(lambda row: calculate_mahalanobis(row.values, mean_vector, inv_cov_matrix), axis=1)

        # determine outliers based on the chi-square distribution
        p_value_threshold = 0.05
        critical_value = chi2.ppf((1 - p_value_threshold), df=len(numerical_variables))

        # classify outliers based on mahalanobis distance relative to critical value
        outliers_mahalanobis_df = data[data['mahalanobis'] > critical_value]

        # clean up df
        data.drop(columns=['mahalanobis'], inplace=True)

        # construct console output
        result.append(f"\n<<______OUTLIERS - MAHALANOBIS METHOD______>>\n")
        result.append(f"Identified outliers based on Mahalanobis distance exceeding the critical value ({critical_value:.2f}) from the chi-square distribution (p-val < {p_value_threshold}.\n")
        result.append(outliers_mahalanobis_df.to_string())

        # appends (continued) #
        # (6-9) method='outliers_mahalanobis' info
        if method.lower() == 'all':
            result.append(f"\n✎ * NOTE: If method='outliers_mahalanobis', aside from the overview above, the function RETURNS:")
            result.append(f"■ 1 - Dataframe: Rows from the original df that were classified as outliers. (preserved index)")
            result.append(f"☻ HOW TO: df = explore_num(yourdf, yourlist, method='outliers_mahalanobis')")

    except np.linalg.LinAlgError as error:
        result.append(f"Error calculating Mahalanobis distance: {error}")
  1. Combining Results and Output:
combined_result = "\n".join(result)

if output.lower() == 'print':
    print(combined_result)

    if method.lower() == 'outliers_mahalanobis':
        return outliers_mahalanobis_df
elif output.lower() == 'return':
    if method.lower() == 'outliers_mahalanobis':
        return outliers_mahalanobis_df

See the Full Function:

The full implementation can be found in the datasafari repository.