elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.52k stars 24.9k forks source link

Add K-means clustering feature #5512

Closed geekpete closed 11 months ago

geekpete commented 10 years ago

Add k-means clustering to allow detection of clusters in data sets. http://en.wikipedia.org/wiki/K-means_clustering

Would be useful for geo points but also other use cases too.

Thanks to https://github.com/koobs for suggesting this one in Sydney Elastic Training.

savioteles commented 10 years ago

It would be great! I really need this feature. Is there any estimate of when you will start coding?

geekpete commented 10 years ago

Not that I'm seeking to drop her in it, but Britta https://twitter.com/a2tirb would definitely have the madskillz to build this feature but not sure of her priorities/bandwidth/interest to attack this feature with sticks.

I'm also not sure how new features are selected or voted up for prioritisation by elasticsearch overlords either.

brwe commented 10 years ago

If I recall correctly, @geekpete proposed to have that in the context of aggregations, that is, build cluster and then use these as buckets inside the aggregations framework. Indeed, this would be an extremely useful feature.

While it would be very much fun to implement unfortunately I do not think we will implement it in the near future. Anyone coming up with a pull request for this is of course more than welcome :-)

For now I can only point you to the carrot2 plugin which does an excellent job in clustering search results.

koobs commented 10 years ago

I'll add the comment that 'k' clusters ought to user-suppliable as an argument to the aggregation for maximum value, with possible k-values being:

For context, I brought this up @ the ElasticSearch Training in response to a brief conversation about search vs 'insight' in relation to data, the former where you know what you're looking for, the latter where you dont, or might not. The specific example was geospatial result sets with arbitrary demography data fields. It was a great session @brwe!

mishakogan commented 10 years ago

I also would like to cast my vote for some kind of automated clustering feature. Carrot2 is great but as far as understand can only work on small amount of data. Would be great to have something that clusters ALL the data all the time. Maybe custom clustering analyzer?

clintongormley commented 10 years ago

@brwe would #8110 help here?

brwe commented 10 years ago

@clintongormley not really. Bucket reducers from #8110 would run on the final aggregation but clustering needs the documents.

jpountz commented 10 years ago

@brwe I think implementing clustering as a reducer could help reduce the cost very significantly? K-means is costly so running such an algorithm on a dataset containing lots of documents could be very slow. On the other hand, if we take geo-clustering as an example, we could make it very fast (though a bit lossy) by working on top of the output of the geo-hash grid aggregation as a bucket reducer?

brwe commented 10 years ago

True, I should distinguish use cases. For up to 2d it might help indeed. For text clustering I do not see it.

yehosef commented 9 years ago

just found this - would be great. +1

dsingley commented 9 years ago

+1

ghost commented 9 years ago

search for this... this would be a very great feature. Also other Mining-algorithms.

colings86 commented 9 years ago

Implementing this as a pipeline aggregation should now be possible. In that case we would first collect values into buckets using other aggregations and then use the pipeline aggregation to create clusters from those buckets.

lessless commented 8 years ago

that would be mad!

lessless commented 8 years ago

@koobs is there a recording of this session somewhere out there?

koobs commented 8 years ago

@lessless I hope not :)

irony commented 8 years ago

This would really be awesome!

audriusbugas commented 8 years ago

+1

reinier-pv commented 8 years ago

:+1:

trupin commented 8 years ago

+1

chenryn commented 8 years ago

+1

marfago commented 8 years ago

+1

hkulekci commented 8 years ago

+1

amazium commented 8 years ago

+1

s1monw commented 8 years ago

I am removing the discuss label and make it adopt me - there has been enough discussion on this.

ryanrozich commented 8 years ago

+1 carrot2 is good for text clustering but does not use the aggregations framework, would be great to have a text clustering option that we can build sub-aggregations / child aggregations underneath.

lessless commented 8 years ago

will we be able to use it with geopoints?

SimoneTosato commented 8 years ago

+1

ddavidebor commented 8 years ago

+1 This feature would be so helpful and powerful

iamdeit commented 8 years ago

What happened with this feature? Is there any work in progress? I think it would be really useful.

tol commented 7 years ago

+1

nknize commented 7 years ago

It's certainly gotten some attention. While a bit stalled at the moment, due to other priorities, @colings86 has a branch for geo_point k-means: https://github.com/colings86/elasticsearch/tree/feature/geokmeans

So a little more patience and this feature will be available soon.

sebastianovide commented 7 years ago

+1

mkarakucuk commented 7 years ago

+1

a-tokyo commented 7 years ago

+1

vasily-kirichenko commented 6 years ago

+1

ItshakEli commented 6 years ago

+1

debb-hp-com commented 6 years ago

+1

lessless commented 6 years ago

@nknize, @colings86 pushed last update to his branch in Jul '17. Is it ready or forced out by the higher-priority work?

colings86 commented 6 years ago

@lessless There has not yet been further work on this and its still a little way off. There are actually other cluster-like aggregations which are likely to be merged first as they are a bit simpler to implement (e.g. https://github.com/elastic/elasticsearch/pull/26659) as they are a bit easier to validate and test for a first implementation of aggregations which merge buckets at collection-time. Although we would like to make progress here, its not something that is being currently tackled as a main task due to other priorities

geekpete commented 6 years ago

Could we collate potential future features on a special section of the roadmap perhaps? This ticket could be closed and referenced to the "potential future features" area of the roadmap. This might help to clear a number of other github tickets that don't have major focus if priority is on other work at the moment.

colings86 commented 6 years ago

Stalled waiting on https://github.com/elastic/elasticsearch/pull/26659

/cc @elastic/es-search-aggs

lessless commented 6 years ago

Still desired :)

LaurentChardin commented 6 years ago

Indeed !! very desired !

lessless commented 6 years ago

@colings86 should "stalled" label be removed now? #26659 was closed in favor of #28993 which is merged now

colings86 commented 6 years ago

It is true that because #28993 is merged the "stalled" label can be removed.

Destroy666x commented 6 years ago

I also confirm it's very desired and I'd be happy to see it.

ivssh commented 6 years ago

+1 for this

ThomasSolti commented 5 years ago

+1

barracuda317 commented 5 years ago

Is the size parameter in https://www.elastic.co/guide/en/elasticsearch/reference/7.0/search-aggregations-bucket-geotilegrid-aggregation.html something like k-means-clustering for geo-search?