geo/geomfn: implement ST_ClusterWithin({geometry,float8})

otan commented 4 years ago

Implement ST_ClusterWithin on arguments {geometry,float8}, which should adopt PostGIS behaviour.

Observers: Please react to this issue if you need this functionality.

For Geometry builtins, please do the following:

Ideally add a relevant helper function in pkg/geo/geomfn (parse and output related functions can go in pkg/geo). Add exhaustive unit tests here - you can run through example test cases and make sure that PostGIS and CRDB return the same result within a degree of accuracy (1cm for geography).
- When using GEOS, you can reference the C API for which functions are available. Unfortunately, Windows is not currently supported when using GEOS.
Create a new builtin that references this function in pkg/sql/sem/builtins/geo_builtins.go. Note that we currently do not support optional arguments, so we define functions that have optional arguments once without the optional argument (using the default value in the optional position), and once with the optional argument.
Modify the tests in pkg/sql/logictest/testdata/logic_test/geospatial to call this functionality at least once. You can call make testbaselogic FILES='geospatial' TESTFLAGS='-rewrite' to regenerate the output. Tests here should just ensure the builtin is linked end to end (your exhaustive unit tests go the above mentioned packages!).
Ensure the documentation is regenerated by calling make buildshort. You can also play with it by calling ./cockroach demo --empty afterwards.
Submit your PR - make sure to follow the guidelines from creating your first PR.

You can follow #48552 for an example PR.

_{:robot: This issue was synced with a spreadsheet by gsheets-to-github-issues by otan on 2023-09-03T23:16:38Z. Changes to titles, body and labels may be overwritten.}

klesniewski commented 3 years ago

We are migrating parts of our application to CockroachDB and we need ST_ClusterWithin to implement displaying assets on the map.

Typically tenants within our system have hundreds to tens of thousands of assets. Unfortunately, putting all tenant's assets on the map individually doesn't work well - server response would be big (or spread over many pages) and client side would have a lot of work to render the map what results in poor user experience. Therefore, our backend takes care of clustering and returns clusters of assets. ST_ClusterWithin would help us by pushing the work of computing clusters to the database engine.

otan commented 3 years ago

Thanks @klesniewski - are there any other clustering algorithms that you find you commonly need?

klesniewski commented 3 years ago

At the moment this is the only use case we have for clustering in transactional processing. I don't have much experience with clustering algorithms in Postgres, but from what I read ST_ClusterWithin sounded closest to what we need. The other ones considered were ST_ClusterDBSCAN and ST_ClusterKMeans (as compared here).

Basically it is not as important for us how exactly the points are clustered together as long it looks quite intuitively in UI. What matters is that ideally we can return for each cluster a number of records in it, and if the cluster is of small size (let's say 1 or 3), then actually have the record data, so that we can put it on the map unclustered. The more performant it is, the better.

klesniewski commented 3 years ago

A little update on our use case.

We have looked into different methods and ruled out DBSCAN and KMeans. We have ruled out KMeans as it requires to specify number of clusters we want to find. We have ruled out DBSCAN because it is not kind of clustering we are looking here for. DBSCAN can cluster non-linearly separable clusters, which is a nice property, but nor desirable in this use case:

According to article DBSCAN Clustering in PostGIS, ST_ClusterWithin is a special case of DBSCAN, so as such we have ruled it out too.

We are currently implementing a solution based on geohash, roughly as described here, here or here.

cockroachdb / cockroach

geo/geomfn: implement ST_ClusterWithin({geometry,float8}) #48901