elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
785 stars 323 forks source link

Distance-based cluster evaluation algorithms will fail, if input numbers are too big #76

Closed bastian-wur closed 4 years ago

bastian-wur commented 4 years ago

Hi everyone,

I'm right now trying to cluster a matrix, and did some back and forth on what I did. The values in the matrix are pretty big, biggest is 10e+300, and the matrix is also pretty dense. I did clustering with k-means, which also produced results, but all internal cluster evaluation algorithms failed to produce anything. This is a result from k-means with k=4

Distance-based Davies Bouldin Index 0.0
Distance-based Density Based Clustering Validation NaN
Distance-based C-Index 1.0
Distance-based PBM-Index NaN
Distance-based Silhouette +-NaN NaN
Distance-based Simp. Silhouette +-NaN NaN
Distance-based Mean distance Infinity
Distance-based Sum of Squares Infinity
Distance-based RMSD Infinity
Distance-based Variance Ratio Criteria NaN
# Concordance
Concordance Gamma 0.9999772178605122
Concordance Tau 0.04571359658825246

In the meantime I do the clustering only on the exponents (so 10e+300 converts to 300), and I do now get useful outputs. So... no idea what is causing this, but I guess something should warn the user.

kno10 commented 4 years ago

At 1e+300 I am not at all surprised that k-means fails to produce results. You simply exceed the range of floating point capabilities. K-means by definition minimizes squared errors. The square of 1e+300 busts the floating point range, and the overflow likely causes these NaN and infinite values to appear.

Closing as won't fix: supporting arbitrary precision would ruin performance, so this is not going to happen. Instead scale your data; either use logspace (with values at this magnitude, this may or may not be more meaningful), or simply scale the data; e.g. by 1e-300.