CartoDB / cartodb-postgresql

PostgreSQL extension for CartoDB
BSD 3-Clause "New" or "Revised" License
111 stars 53 forks source link

Use kmeans-postgresql for Jenks #183

Open pnorman opened 8 years ago

pnorman commented 8 years ago

Jenk's natural breaks are a special case of kmeans clustering in 1d. The kmeans-postgresql is a C implementation of kmeans.

Benchmarking shows kmeans-postgresql to be orders of magnitude faster than the plpgsql impelementation currently in cartodb-postgresql

Times for 5 bucket Jenks
Rows      kmeans-postgresql  CDB_JenksBins
1k        1.75ms             912ms
233k      374ms              5795ms

Using kmeans internally for the math would be a substantial speed increase.

jgoizueta commented 8 years ago

Having this function in the users' databases could be useful to create point overviews by clustering points spatially.

Like here: http://gis.stackexchange.com/questions/11567/spatial-clustering-with-postgis

jgoizueta commented 8 years ago

Did a little clustering experiment here and sadly it seems to slow for this purpose.

pnorman commented 8 years ago

Clustering isn't something you'd normally run interactively for display, but for analysis. With 4.1m points I get times of 19s for 10 buckets and 137s for 100 buckets.

The common heuristic for kmeans is O(nkdi), where n is the number of d-dimensional vectors, k the number of clusters and i the number of iterations needed until convergence.

It's not a tool for everything, but all clustering except simple grid clustering will have the same problems