Distance-based cluster evaluation algorithms will fail, if input numbers are too big

elki-project / elki

ELKI Data Mining Toolkit

GNU Affero General Public License v3.0

785 stars 323 forks source link

Hi everyone,

I'm right now trying to cluster a matrix, and did some back and forth on what I did. The values in the matrix are pretty big, biggest is 10e+300, and the matrix is also pretty dense. I did clustering with k-means, which also produced results, but all internal cluster evaluation algorithms failed to produce anything. This is a result from k-means with k=4

Distance-based Davies Bouldin Index 0.0
Distance-based Density Based Clustering Validation NaN
Distance-based C-Index 1.0
Distance-based PBM-Index NaN
Distance-based Silhouette +-NaN NaN
Distance-based Simp. Silhouette +-NaN NaN
Distance-based Mean distance Infinity
Distance-based Sum of Squares Infinity
Distance-based RMSD Infinity
Distance-based Variance Ratio Criteria NaN
# Concordance
Concordance Gamma 0.9999772178605122
Concordance Tau 0.04571359658825246

In the meantime I do the clustering only on the exponents (so 10e+300 converts to 300), and I do now get useful outputs. So... no idea what is causing this, but I guess something should warn the user.

elki-project / elki

Distance-based cluster evaluation algorithms will fail, if input numbers are too big #76