ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new calculator util: calculate_mahalanobis() #14

Closed ETA444 closed 7 months ago

ETA444 commented 9 months ago

Description:


Method Functionality Idea:

The calculate_mahalanobis function calculates the Mahalanobis distance for an observation from a distribution. Used in explore_num method for outlier detection using this calculator.

How it operates:

The Mahalanobis distance is a measure of the distance between a point and a distribution, considering the covariance among variables. This function computes the Mahalanobis distance of a single observation from the mean of a distribution, given the inverse of the covariance matrix of the distribution.

Parameters:

Returns:

Examples:

mean_vector = np.array([0, 0])
observation = np.array([1, 1])
cov_matrix = np.array([[1, 0.5], [0.5, 1]])
inv_cov_matrix = np.linalg.inv(cov_matrix)
calculate_mahalanobis(observation, mean_vector, inv_cov_matrix)
# Output: 2.0
ETA444 commented 7 months ago

Implementation Summary:

The calculate_mahalanobis() function calculates the Mahalanobis distance for an observation from a distribution, which is useful for identifying how far an observation is from the mean, considering covariance among variables.

Purpose:

The function's purpose is to compute the Mahalanobis distance, which measures how many standard deviations an observation is from the mean of a distribution, taking into account the correlations among variables.

Code Breakdown:

  1. Purpose of the Function:

    • Purpose: To calculate the Mahalanobis distance for an observation from a distribution.
    def calculate_mahalanobis(
       x: Union[np.ndarray, pd.Series],
       mean: np.ndarray,
       inv_cov_matrix: np.ndarray
    ) -> float:
    • The Mahalanobis distance is effective for determining how far an observation is from the mean of a distribution, considering the covariance among variables.
  2. Parameter Definitions:

    • Purpose: To define the function's parameters.
    Parameters
    ----------
    x : numpy.ndarray or pandas.Series
       A 1D array of the observation or a single row from a DataFrame.
    mean : numpy.ndarray
       The mean vector of the distribution from which distances are calculated.
       Must be 1D and of the same length as `x`.
    inv_cov_matrix : numpy.ndarray
       The inverse of the covariance matrix of the distribution. This matrix
       must be square and its size should match the number of elements in `x`.
  3. Return Definition:

    • Purpose: To define the function's return type.
    Returns
    -------
    float
       The Mahalanobis distance of the observation `x` from the distribution
       defined by `mean` and `inv_cov_matrix`.
  4. Raise Definitions:

    • Purpose: To define the exceptions the function can raise.
    Raises
    ------
    ValueError
       If `x` and `mean` do not have the same length.
    LinAlgError
       If the inverse covariance matrix is singular and cannot be used for
       distance calculation.
  5. Check Lengths of x and mean:

    • Purpose: To ensure x and mean have the same length.
    if len(x) != len(mean):
       raise ValueError("The observation and mean must have the same length.")
  6. Calculate Mahalanobis Distance:

  1. Return Result:

    • Purpose: To return the computed distance.
    return distance
  2. Examples:

    • Purpose: To provide examples of how to use the function.
    Examples
    --------
    >>> mean_vector = np.array([0, 0])
    >>> observation = np.array([1, 1])
    >>> cov_matrix = np.array([[1, 0.5], [0.5, 1]])
    >>> inv_cov_matrix = np.linalg.inv(cov_matrix)
    >>> calculate_mahalanobis(observation, mean_vector, inv_cov_matrix)
    2.0
  3. Notes:

    • Purpose: To provide additional context and applications for the function.
    Notes
    -----
    The Mahalanobis distance is widely used in outlier detection and cluster analysis.
    It is scale-invariant and takes into account the correlations of the data set.

See the Full Function:

The full implementation can be found in the datasafari repository.