Open otan opened 4 years ago
We are migrating parts of our application to CockroachDB and we need ST_ClusterWithin to implement displaying assets on the map.
Typically tenants within our system have hundreds to tens of thousands of assets. Unfortunately, putting all tenant's assets on the map individually doesn't work well - server response would be big (or spread over many pages) and client side would have a lot of work to render the map what results in poor user experience. Therefore, our backend takes care of clustering and returns clusters of assets. ST_ClusterWithin would help us by pushing the work of computing clusters to the database engine.
Thanks @klesniewski - are there any other clustering algorithms that you find you commonly need?
At the moment this is the only use case we have for clustering in transactional processing. I don't have much experience with clustering algorithms in Postgres, but from what I read ST_ClusterWithin sounded closest to what we need. The other ones considered were ST_ClusterDBSCAN and ST_ClusterKMeans (as compared here).
Basically it is not as important for us how exactly the points are clustered together as long it looks quite intuitively in UI. What matters is that ideally we can return for each cluster a number of records in it, and if the cluster is of small size (let's say 1 or 3), then actually have the record data, so that we can put it on the map unclustered. The more performant it is, the better.
A little update on our use case.
We have looked into different methods and ruled out DBSCAN and KMeans. We have ruled out KMeans as it requires to specify number of clusters we want to find. We have ruled out DBSCAN because it is not kind of clustering we are looking here for. DBSCAN can cluster non-linearly separable clusters, which is a nice property, but nor desirable in this use case:
According to article DBSCAN Clustering in PostGIS, ST_ClusterWithin is a special case of DBSCAN, so as such we have ruled it out too.
We are currently implementing a solution based on geohash, roughly as described here, here or here.
Implement
ST_ClusterWithin
on arguments {geometry,float8}, which should adopt PostGIS behaviour.Observers: Please react to this issue if you need this functionality.
For Geometry builtins, please do the following:
pkg/geo/geomfn
(parse and output related functions can go inpkg/geo
). Add exhaustive unit tests here - you can run through example test cases and make sure that PostGIS and CRDB return the same result within a degree of accuracy (1cm for geography).pkg/sql/sem/builtins/geo_builtins.go
. Note that we currently do not support optional arguments, so we define functions that have optional arguments once without the optional argument (using the default value in the optional position), and once with the optional argument.pkg/sql/logictest/testdata/logic_test/geospatial
to call this functionality at least once. You can callmake testbaselogic FILES='geospatial' TESTFLAGS='-rewrite'
to regenerate the output. Tests here should just ensure the builtin is linked end to end (your exhaustive unit tests go the above mentioned packages!).make buildshort
. You can also play with it by calling./cockroach demo --empty
afterwards.You can follow #48552 for an example PR.
:robot: This issue was synced with a spreadsheet by gsheets-to-github-issues by otan on 2023-09-03T23:16:38Z. Changes to titles, body and labels may be overwritten.