elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.53k stars 24.9k forks source link

Add K-means clustering feature #5512

Closed geekpete closed 11 months ago

geekpete commented 10 years ago

Add k-means clustering to allow detection of clusters in data sets. http://en.wikipedia.org/wiki/K-means_clustering

Would be useful for geo points but also other use cases too.

Thanks to https://github.com/koobs for suggesting this one in Sydney Elastic Training.

polyfractal commented 5 years ago

@barracuda317 Not really, no. GeoTile just overlays a fixed grid over the area and aggregates documents into those grid cells. The grids are constructed irregardless of the data distribution (think of it more like a heatmap).

Clustering like k-means dynamically identifies regions of data that are "similar" and groups them together into a cluster. Clustering can give you individual clusters that are different shapes, sizes, and densities. For a practical example, a clustering algo might group all the values inside a city together, then cluster the rural countryside together as a different group (much larger but also more sparse)

jamesdorfman commented 5 years ago

@colings86 is there a reason why this was never completed? If the issue is time commitment, I'm very interested in this feature and would like to try finishing the implementation.

mayya-sharipova commented 5 years ago

@jamesdorfman Can you please describe your use case. Are you interested to have k-means clustering on geo data (they can be up to 8 dims)?

lessless commented 5 years ago

@mayya-sharipova +1 to exactly that use case

jamesdorfman commented 5 years ago

@mayya-sharipova yes, I was specifically interested in implementing the geo data use case. This thread made it seem as though this specific feature is highly desired.

Furthermore, I'm not completely certain about how difficult this will be to implement, so I also think that the restricted use case of clustering only geo data is a good starting point.

LaurentChardin commented 5 years ago

Another use case was to group ranges of prices of products within an index, and use k-cluster to propose cluster of prices to use with price selection.

jamesdorfman commented 5 years ago

Upon further research and experimentation it seems that a more straightforward approach would be to implement an agglomerative hierarchical clustering algorithm, rather than k-means clustering.

K-means involves creating k buckets, and then reassigning data points at each iteration of the algorithm. On the other hand, in agglomerative hierarchical clustering each point is initially placed in its own cluster. Then, these clusters are merged together on subsequent iterations. https://en.wikipedia.org/wiki/Hierarchical_clustering

I am currently working on implementing this clustering feature as a histogram multi-bucket aggregation. The k-means approach would involve moving documents between buckets at each iteration; however, the hierarchical method would simply entail creating a bucket for each document and then merging them until the desired number of buckets is reached. This functionality is very similar to the existing Auto Date Histogram Aggregation, where buckets are created and then merged. Since bucket merging functionality was already created for that aggregation, this approach is significantly easier to implement.

Furthermore, it seems that both methods can produce clusters of similar quality. https://www.cs.utah.edu/~piyush/teaching/4-10-print.pdf

Please let me know if this line of reasoning makes sense :)

arshad171 commented 5 years ago

@jamesdorfman Isn't Agglomerative Hierarchical Clustering expensive in terms time and space complexities compared to K-means?

Agglomerative clustering: time complexity: O(n^3) space complexity: O(n^2)

K-means: time complexity: O(n k m) space complexity: O((n + k)m)

And since K-means has implementations that support incremental learning, the space complexity can be further reduced to make it constant.

sroui commented 3 years ago

+1

geekpete commented 1 year ago

Do we think this issue is still relevant? I see it's now 10 years ago that I first opened it.

ddavidebor commented 1 year ago

This brings back so many memories...

wchaparro commented 11 months ago

Closing this as not planned in Aggregations. If we are going to develop this - it will go into ES|QL, where we are focusing future development.

koobs commented 10 months ago

😢