Selecting top n predictors for clustering

While testing, I noticed that including all the predictors makes it hard to get proper pairwise similarities.

The problem is probably caused by trying to calculate the covariance matrix for 28 predictors with only 20 samples. When I attempt to invert the covariance matrix, it often ends up singular. Even with pseudo-inverses, the determinant is often so close to 0 that the Bhattacharyya distances come out as infinite. Infinite distance gives 0 similarity, and all pairs were infinitely distant.

When I selected the top 3 predictors, and only used them for calculating the Bhattacharyya distances, the results were much better, and the histogram of similarities showed a smooth normal distribution, peaking around 0.6, with some spikes at 1 and 0.

I suspect that including uninformative predictors created variances of 0 width along some directions, which broke the Bhattacharyya distance algorithm.

I propose to only select some predictors, choosing the top n most informative predictors for each GF model.

n can be:

a constant
decided by taking the top n predictors that contain x% of the sum of all importances.

Option 2 is analogous to selecting PCA components that explain 80% of the total variance.

MathMarEcol / pdyer_aus_bio

Selecting top n predictors for clustering #7