ChristianGoueguel / HotellingEllipse

The HotellingEllipse package helps draw the Hotelling's T-squared ellipse on a PCA or PLS score scatterplot by computing the Hotelling's T-squared statistic and providing the ellipse's coordinates, semi-minor, and semi-major axes lengths.
https://christiangoueguel.github.io/HotellingEllipse/
Other
7 stars 0 forks source link

Issue with Mahalanobis distance computation #3

Closed morelju closed 4 months ago

morelju commented 4 months ago

Hi there, R returns the following when running ellipseParam on a spectral dataset: Error in solve.default(cov, ...) : system is computationally singular: reciprocal condition number = 1.37309e-48 As I understood it, this is related to the value of the tol parameter of the mahalanobis function. My question is: do you have any suggestion on how to adjust this tol parameter within ellipseParam? Thanks!

Julien

ChristianGoueguel commented 4 months ago

Hi @morelju The error message you're encountering indicates that the covariance matrix is computationally singular, which means it's nearly impossible to invert accurately. This issue often arises, for instance, in spectral datasets due to their high dimensionality and multicollinearity. Let me break down the problem and offer some solutions: (1) Cause of the error:

(2) Reasons for singularity:

(3) Recommended solutions: Apply Principal Component Analysis (PCA) to your dataset before using ellipseParam(). Consider other dimensionality reduction techniques (PLS, ICA, etc...) if appropriate for your project.

morelju commented 4 months ago

Hi @ChristianGoueguel Thanks a lot for your explanations! I actually got the error message after applying the PCA - I tried to run the ellipseParam() on the scores. I will try a couple of other dimensionality reduction and see if this solves my problem. I would close the comment once a solution is found, if that's ok with you. Thanks again,

Julien

morelju commented 4 months ago

Hi again @ChristianGoueguel , In the end things work if I drop some PCs before using ellipseParam(), which was a bit counter-intuitive for me as I would have thought that the k parameter of ellipseParam() would handle this. Not sure if I misunderstood it or if this is a bug in your code?

ChristianGoueguel commented 4 months ago

Hi @morelju, As I mentioned in my previous comment, the error message originates from the covariance matrix inversion using the solve() function. The resolution you found by dropping a few principal components before using ellipseParam() confirms that the issue stemmed from principal components in your scores dataset with very low variance. Retaining only the principal components that effectively summarize your dataset is indeed a good practice.

To document and illustrate the error you encountered, I've conducted a few tests:

  1. Loading the spectral dataset:
data("specData", package = "HotellingEllipse")
  1. Running PCA on the spectral dataset:
pca_mod <- specData |>
     dplyr::select(tidyselect::where(is.numeric)) |>
     FactoMineR::PCA(scale.unit = FALSE, graph = FALSE)
  1. Retrieving the PCA scores:
pca_scores <- pca_mod |>
     purrr::pluck("ind", "coord") |>
     tibble::as_tibble()
print(pca_scores)
# A tibble: 100 × 5
     Dim.1    Dim.2  Dim.3   Dim.4 Dim.5
     <dbl>    <dbl>  <dbl>   <dbl> <dbl>
 1 25306.  -10831.  -1851.   -83.4 -560.
 2   -67.3   1137.  -2946.  2495.  -568.
 3 -1822.     -22.0 -2305.  1640.  -409.
 4 -1238.    3734.   4039. -2428.   379.
 5  3299.    4727.   -888. -1089.   262.
 6  5006.     -49.5  2534.  1917.  -970.
 7 -8325.   -5607.    960. -3361.   103.
 8 -4955.   -1056.   2510.  -397.  -354.
 9 -1610.    1271.  -2556.  2268.  -760.
10 19582.    2289.    886.  -843.  1483.
# ℹ 90 more rows
# ℹ Use `print(n = ...)` to see more rows

As you can see, pca_scores has 5 components and is used as an example in the package README.

To demonstrate the error conditions, I've added a sixth column, Dim.6, and conducted two tests:

HotellingEllipse::ellipseParam(pca_scores) Error in solve.default(cov, ...) : Lapack routine dgesv: system is exactly singular: U[6,6] = 0


- **Test 2:** `Dim.6` **with near-zero variance.**
```r
tol <- .Machine$double.eps
x <- stats::rnorm(n, mean = 0, sd = sqrt(tol))
pca_scores <- pca_scores |> dplyr::mutate(Dim.6 = x * (sqrt(tol) / stats::sd(x)))

HotellingEllipse::ellipseParam(pca_scores)
Error in solve.default(cov, ...) : 
  system is computationally singular: reciprocal condition number = 5.2279e-24

Key points to note:

Potential future improvements: Both tests were conducted using the default parameters, which means the Mahalanobis distance is computed before the selection of the number of component k; leading to issues when low-variance components are present in the dataset. I could modify the ellipseParam function so that users specify k components before calculating the Mahalanobis distance and add a variance threshold or warning message to avoid the risk of including low-variance components that could lead to singularity issues.

If you have any questions or need further clarification, please don't hesitate to ask. Otherwise, I'll proceed to close this issue.

Thank you for bringing this to my attention,

morelju commented 4 months ago

Hi @ChristianGoueguel All good with me now, thanks again for your clarifications!

ChristianGoueguel commented 3 months ago

An updated version (v1.2.0) is available on CRAN.