Open Boarders opened 5 years ago
This commit gives a version of K-medians using essentially Lloyds's algorithm for refinement and RGC (centroids of random subsamples) for initialisation. It is abstract enough that it would work as both an implementation of k-medians or k-means. This also exposes some parallelism in how assignment is done for each cluster, but finding the correct way to chunk the input will require experimentation.
Still left on this issue is to define a k-medians function on CharacterSequences
.
Currently our clustering pass uses the algorithms in
AI.Clustering.Hierarchical
. As it stands there is no easy way to make use of the k-means algorithm within this library.We should write a k-medians algorithm which is linear time and has good convergence properties (it should also expose a good amount of parallelism). This will entail writing a k-medians function for each of our different character types i.e. functions that looks something like:
In the case of a discrete character this will function will look like an inclusion-exclusion formula and we should be able to write similar things for other character types.
This would then be added as a Clustering option (currently asking for median clustering gives an error explaining that this is not yet implemented).