cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.03k stars 3.79k forks source link

geo/geomfn: implement ST_ClusterWithin({geometry,float8}) #48901

Open otan opened 4 years ago

otan commented 4 years ago

Implement ST_ClusterWithin on arguments {geometry,float8}, which should adopt PostGIS behaviour.

Observers: Please react to this issue if you need this functionality.

For Geometry builtins, please do the following:

You can follow #48552 for an example PR.

:robot: This issue was synced with a spreadsheet by gsheets-to-github-issues by otan on 2023-09-03T23:16:38Z. Changes to titles, body and labels may be overwritten.

klesniewski commented 3 years ago

We are migrating parts of our application to CockroachDB and we need ST_ClusterWithin to implement displaying assets on the map.

image

Typically tenants within our system have hundreds to tens of thousands of assets. Unfortunately, putting all tenant's assets on the map individually doesn't work well - server response would be big (or spread over many pages) and client side would have a lot of work to render the map what results in poor user experience. Therefore, our backend takes care of clustering and returns clusters of assets. ST_ClusterWithin would help us by pushing the work of computing clusters to the database engine.

otan commented 3 years ago

Thanks @klesniewski - are there any other clustering algorithms that you find you commonly need?

klesniewski commented 3 years ago

At the moment this is the only use case we have for clustering in transactional processing. I don't have much experience with clustering algorithms in Postgres, but from what I read ST_ClusterWithin sounded closest to what we need. The other ones considered were ST_ClusterDBSCAN and ST_ClusterKMeans (as compared here).

Basically it is not as important for us how exactly the points are clustered together as long it looks quite intuitively in UI. What matters is that ideally we can return for each cluster a number of records in it, and if the cluster is of small size (let's say 1 or 3), then actually have the record data, so that we can put it on the map unclustered. The more performant it is, the better.

klesniewski commented 3 years ago

A little update on our use case.

We have looked into different methods and ruled out DBSCAN and KMeans. We have ruled out KMeans as it requires to specify number of clusters we want to find. We have ruled out DBSCAN because it is not kind of clustering we are looking here for. DBSCAN can cluster non-linearly separable clusters, which is a nice property, but nor desirable in this use case: image

According to article DBSCAN Clustering in PostGIS, ST_ClusterWithin is a special case of DBSCAN, so as such we have ruled it out too.

We are currently implementing a solution based on geohash, roughly as described here, here or here.