Closed ETA444 closed 7 months ago
Implementation Summary:
The implementation of the outliers_mahalanobis
method within the explore_num()
function focuses on detecting outliers in the numerical variables of a DataFrame using the Mahalanobis distance. The Mahalanobis distance is a useful metric for identifying multivariate outliers, particularly when the variables are correlated.
Code Breakdown:
def explore_num(
df: pd.DataFrame,
numerical_variables: List[str],
method: str = 'all',
output: str = 'print',
threshold_z: int = 3
) -> Optional[str]:
"""
Analyze numerical variables in a DataFrame for distribution characteristics, outlier detection, normality tests, skewness, kurtosis, correlation analysis, and multicollinearity detection.
# Full docstring continues...
"""
explore_num()
function analyzes numerical variables in a DataFrame.df
), a list of numerical variables (numerical_variables
), a method (method
), an output format (output
), and a threshold for Z-score outliers (threshold_z
).if method.lower() in ['outliers_mahalanobis', 'all']:
# definitions #
# use non-na df: data
data = df.copy()
data = data[numerical_variables].dropna()
try:
# calculate the mean and inverse of the covariance matrix
mean_vector = data.mean().values
inv_cov_matrix = inv(np.cov(data, rowvar=False))
# apply the utility function to calculate Mahalanobis distance for each observation
data['mahalanobis'] = data.apply(lambda row: calculate_mahalanobis(row.values, mean_vector, inv_cov_matrix), axis=1)
# determine outliers based on the chi-square distribution
p_value_threshold = 0.05
critical_value = chi2.ppf((1 - p_value_threshold), df=len(numerical_variables))
# classify outliers based on mahalanobis distance relative to critical value
outliers_mahalanobis_df = data[data['mahalanobis'] > critical_value]
# clean up df
data.drop(columns=['mahalanobis'], inplace=True)
# construct console output
result.append(f"\n<<______OUTLIERS - MAHALANOBIS METHOD______>>\n")
result.append(f"Identified outliers based on Mahalanobis distance exceeding the critical value ({critical_value:.2f}) from the chi-square distribution (p-val < {p_value_threshold}.\n")
result.append(outliers_mahalanobis_df.to_string())
# appends (continued) #
# (6-9) method='outliers_mahalanobis' info
if method.lower() == 'all':
result.append(f"\n✎ * NOTE: If method='outliers_mahalanobis', aside from the overview above, the function RETURNS:")
result.append(f"■ 1 - Dataframe: Rows from the original df that were classified as outliers. (preserved index)")
result.append(f"☻ HOW TO: df = explore_num(yourdf, yourlist, method='outliers_mahalanobis')")
except np.linalg.LinAlgError as error:
result.append(f"Error calculating Mahalanobis distance: {error}")
combined_result = "\n".join(result)
if output.lower() == 'print':
print(combined_result)
if method.lower() == 'outliers_mahalanobis':
return outliers_mahalanobis_df
elif output.lower() == 'return':
if method.lower() == 'outliers_mahalanobis':
return outliers_mahalanobis_df
output
parameter, the results are either printed or returned. For the outliers_mahalanobis
method, the DataFrame containing the outliers is returned when output
is set to 'return'
.See the Full Function:
The full implementation can be found in the datasafari repository.
Description:
Method Functionality Idea:
The
outliers_mahalanobis
method identifies outliers in numerical variables using the Mahalanobis distance, which measures the distance of each observation from the mean in multidimensional space. This method is particularly effective for detecting outliers in multivariate data and takes into account the covariance structure between variables.How it operates:
The method first prepares the data by removing missing values and calculates the mean vector and the inverse of the covariance matrix. Then, it calculates the Mahalanobis distance for each observation using a utility function (
calculate_mahalanobis
). Next, it determines outliers based on the Mahalanobis distance exceeding a critical value derived from the chi-square distribution. Observations with Mahalanobis distances greater than the critical value are classified as outliers. The method returns a DataFrame containing the rows from the original DataFrame classified as outliers.Usage:
To detect outliers in numerical variables using the Mahalanobis distance method:
This method returns a DataFrame containing the rows from the original DataFrame classified as outliers based on the Mahalanobis distance.
Example:
Notes:
method
parameter is set to 'all', the function also returns a DataFrame containing the original DataFrame's outlier rows based on Mahalanobis distance.