Closed dkajtoch closed 4 years ago
This algorithm requires a quadratic distance matrix. It uses a linearized memory layout of the upper triangular matrix. Java arrays are limited in size to 2^31 instances, hence the maximum data set size supported is 65536 (because 65537*65536/2 > 2^31)
Do the math: 170k instances, that is 170000 169999 / 2 8 bytes = 107 GB RAM. The runtime will be similarly bad.
You need to sample your data. If you really need to process the entire data this way (and you do have enough RAM), you really need to implement this yourself (then you could also use single-precision, for example). You will also want to parallelize the distance matrix computation, too. Because the runtime also goes up quadratic (worst case: cubic) with data set size. You can likely also use some pre-aggregation to first reduce the data set size, then cluster only the aggregates.
Closing as won't fix. The algorithm is quadratic, it will not scale to large data. And that is exactly what the error message says, the maximum size supported is 65536.
Alright, thanks @kno10!
I am trying to run hierarchical clustering with around 170k vectors of size 768 each and ELKI throws the message
my code looks like this: