elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
785 stars 323 forks source link

Anderberg Hierarchical clustering - maximum array size reached #72

Closed dkajtoch closed 4 years ago

dkajtoch commented 4 years ago

I am trying to run hierarchical clustering with around 170k vectors of size 768 each and ELKI throws the message

This implementation does not scale to data sets larger than 65536 instances (~16 GB RAM), at which point the Java maximum array size is reached.

my code looks like this:

java -cp "src/dependency/*:src/elki/*" de.lmu.ifi.dbs.elki.application.KDDCLIApplication -dbc.in input.tsv -algorithm clustering.hierarchical.AnderbergHierarchicalClustering
kno10 commented 4 years ago

This algorithm requires a quadratic distance matrix. It uses a linearized memory layout of the upper triangular matrix. Java arrays are limited in size to 2^31 instances, hence the maximum data set size supported is 65536 (because 65537*65536/2 > 2^31)

Do the math: 170k instances, that is 170000 169999 / 2 8 bytes = 107 GB RAM. The runtime will be similarly bad.

You need to sample your data. If you really need to process the entire data this way (and you do have enough RAM), you really need to implement this yourself (then you could also use single-precision, for example). You will also want to parallelize the distance matrix computation, too. Because the runtime also goes up quadratic (worst case: cubic) with data set size. You can likely also use some pre-aggregation to first reduce the data set size, then cluster only the aggregates.

Closing as won't fix. The algorithm is quadratic, it will not scale to large data. And that is exactly what the error message says, the maximum size supported is 65536.

dkajtoch commented 4 years ago

Alright, thanks @kno10!